An Introduction to 
Multivariate Statistical Analysis 
Third Edition 


T. W. ANDERSON 


Stanford University 
Department of Statistics 
Stanford, CA 


ey 
INTERSCIENCE 


A JOHN WILEY & SONS, INC., PUBLICATION 


Copyright © 2003 by John Wiley & Sons, Inc. All rights reserved. 


Published by John Wiley & Sons, Inc. Hoboken, New Jersey 
Published simulGmeously in Canada, 


No part of this publication may be reproduced, stored in a retrieval system or transmitted in any 
form or by any mevins, electronic, mechanical, photocopying, recording, Scanning ov otherwise, 
except us permitted under Section 107 or 108 of the {976 United States Copyright Act, without 
either the prior writ’en permission of the Publisher, or ai thorization through payment of the 
appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, 
Danvers, MA 01923, 978-750-8400, fax 978-750-4470, or on the web ut www.copyright,com, 
Requests to the Publisher for permission should be addressed to the Permissions Department, 
John Wiley & Sons, Inc,, 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 
748-6008, e-mail: permreq@wiley.com. 


Limit of Liability/ Disclaimer of Warranty: While the publisher and author have used their best 
efforts in preparing this book, they make no representations or warranties with respect to 
the accuracy or completeness of the contents of this book and specifically disclaim any 
implied warranties of merchantability or fitness for a particular purpose. No warranty may be 
created or extended by sales representatives or written sales materials. The advice and 
strategies contained herein may not be suitable for your situation. You should consult with 

a professional where appropriate. Neither the publisher nor author shall be tiable for any 
toss of profit or any other commercial damages, including but not limited to special, 
incidental, consequential, or other damages. 


For general information on our otber products and services please contact our Customer 
Care Department within the U.S, at 877-762-2974, outside the U.S, at 317-572-3993 or 
fax 317-572-4002. 


Wiley also publishes its books in a variety of electronic forrnats. Some content that appears 
in print, however, may not be available in electronic format. 


Library of Congress Cataloging-in-Publication Data 


Anderson, T. W. (Theodore Wilbur), 1918- 
An introduction to multivariate statistical analysis / Theodore W. Anderson.-- 3rd ed. 
p. cm,-- (Wiley series in probability and mathematical statistics) 
Includes bibliographical references and index. 
ISBN 0-471-36091-0 (cloth : acid-free paper) 
1, Multivariate analysis. I. Title. [. Series. 


QA278.A516 2003 
519,.5'35--de21 2002034317 


Printed in the United States of America 


1098765432) 


To 
DOROTHY 


Contents 


Preface to the Third Edition 


Preface to the Second Edition 


Preface to the First Edition 


1 Introduction 


1.1. 
2k 


Multivariate Statistical Analysis, 1 
The Multivariate Normal Distribution, 3 


2 The Multivariate Normal Distribution 


2ili 
2.2. 
2:3: 
2.4. 


255 


2.6, 
eels 


Introduction, 6 
Notions of Multivariate Distributions, 7 
The Multivariate Normal Distribution, 13 


The Distribution of Linear Combinations of Normally 
Distributed Variates; Independence of Variates; 
Marginal Distributions, 23 


Conditional Distributions and Multiple Correlation 
Coefficient, 33 


The Characteristic Function; Moments, 41 
Elliptically Contoured Distributions, 47 


Problems, 56 


3 Estimation of the Mean Vector and the Covariance Matrix 


3.1, 


Introduction, 66 


66 


vii 


Vid 


4 


3.3, 


3.4. 


3.5: 
3.6. 


CONTENTS 


Vhe Maximum Likelthood Estimators of the Mean Vector 
and the Covariance Matrix, 67 


The Distribution of the Sample Mean Vector; Inference 
Concerning the Mean When the Covariance Matrix Is 
Known, 74 


Theoretical Properties of Estimators of the Mean 
Vector, 83 


Improved Estimation of the Mean, 91 
Elliptically Contoured Distributions, 101 


Problems, 108 


The Distributions and Uses of Sample Correlation Coefficients 


4.1. 


4.2. 
4.3. 


4.4, 
45. 


Introduction, 115 

Correlation Cocllictent of a Bivariate Sample, 116 
Partial Correlation Coefficients; Conditional 
Distributions, 136 

The Multiple Correlation Coefficient, 144 
Elliptically Contoured Distributions, 158 


Problems, 163 


The Generalized T7-Statistic 


Sl; 
5.2. 


3:3: 
5.4, 


Sia 


5.6. 


Del 


Introduction, 170 


Derivation of the Generalized T2-Statistic and Its 
Distribution, 171 


Uses of the T?-Statistic, 177 
The Distribution of T* under Alternative Hypotheses; 
The Power Function, 185 


The Two-Sample Problem with Unequal Covariance 
Matrices, 187 


Some Optimal Properties of the T?-Test, 190 
Elliptically Contoured Distributions, 199 


Problems, 201 


Classification of Observations 


6.1, 
6.2. 
6.3, 


The Problem of Classification, 207 

Standards of Good Classification, 208 

Procedures of Classification into One of ‘Two Populations 
with Known Probability Distributions, 211] 


115 


170 


207 


CONTENTS 1x 


6.4. Classification into One of Two Known Multivariate Normal 
Populations, 215 


6.5. Classification into One of Two Multivariate Normal 
Populations When the Parameters Are Estimated, 219 

6.6, Probabilities of Misclassification, 227 

6.7, Classification into One of Several Populations, 233 

6.8. Classification into One of Several Multivariate Normal 
Populations, 237 

6.9, An Example of Classification into One of Several 
Multivariate Normal Populations, 240 

6.10. Classification into One of Two Known Multivariate Normal 
Populations with Unequal Covariance Matrices, 242 

Problems, 248 


7 The Distribution of the Sample Covaris nce Matrix and the 
Sample Generalized Variance 251 


7.1, Introduction, 251 

7.2. The Wishart Distribution, 252 

7,3. Some Properties of the Wishart Distribution, 258 

7.4. Cochran’s Theorem, 262 

7.5. The Generalized Variance, 264 

7.6. Distribution of the Set of Correlation Coefficients When 
the Population Covariance Matrix Is Diagonal, 270 


7.7, The Inverted Wishart Distribution and Bayes Estimation of 
the Covariance Matrix, 272 


7.8. Improved Estimation of the Covariance Matrix, 276 
7.9, Elliptically Contoured Distributions, 282 
Problems, 285 


8 Testing the General Linear Hypothesis; Multivariate Analysis 
of Variance 291 


8.1. Introduction, 291 

8.2. Estimators of Parameters in Multivariate Linear 
Regression, 292 

8.3. Likelihood Ratio Criteria for Testing Linear Hypotheses 
about Regression Coefficients, 298 


8.4. The Distribution of the Likelihood Ratio Criterion When 
the Hypothesis Is True, 304 


CONTENTS 


8.5. An Asymptotic Expansion of the Distribution of the 
Likelihood Ratio Criterion, 316 


8.6. Other Criteria for Testing the Linear Hypothesis, 326 

8.7. Tests of Hypotheses about Matrices of Regression 
Coefficients and Confidence Regions, 337 

8.8. Testing Equality of Means of Several Normal Distributions 
with Common Covariance Matrix, 342 

8.9. Multivariate Analysis of Variance, 346 

8.10. Some Optimal Properties of Tests, 353 

8.11. Elliptically Contoured Distributions, 370 

Problems, 374 


9 Testing Independence of Sets of Variates 


10 


9,4. 


9.5, 
9.6, 
9.7; 
9.8. 
09, 
9.10, 


9.11. 


Introductiom, 381 
The Likelihood Ratio Criterion for Testing Independence 
of Sets of Variates, 381 


The Distribution of the Likelihood Ratio Criterion When 
the Null Hypothesis Is True, 386 


An Asymptotic Expansion of the Distribution of the 
Likelihood Ratio Criterion, 390 


Other Criteria, 391 

Step-Down Procedures, 393 

An Example, 396 

The Case of Two Sets of Variates, 397 
Admissibility of the Likelihood Ratio Test, 401 


Monotonicity of Power Functions of Tests of 
Independence of Sets, 402 


Elliptically Contoured Distributions, 404 


Problems, 408 


Testing Hypotheses of Equality of Covariance Matrices and 
Equality of Mean Vectors and Covariance Matrices 


10.1. 
10.2. 


10,3, 


Introduction, 411 


Criteria for Testing Equality of Several Covariance 
Matrices, 412 


Criteria for Testing That Several Normal Distributions 
Are Identical, 415 


Distributions of the Criteria, 417 


381 


411 


CONTENTS 


11 


12 


xi 

10.5. Asymptotic Expansions of the Distributions of the 
Criteria, 424 

10.6, The Case of Two Populations, 427 

10.7. Testing the Hypothesis That a Covariance Matrix 
Is Proportional to a Given Matrrix; The Sphericity 
Test, 431 

10.8. ‘Testing the Hypothesis That a Covariance Matrix Is 
Equal to a Given Matrix, 438 

10.9. Testing the Hypothesis That a Mean Vector and a 
Covariance Matrix Are Equal to a Given Vector and 
Matrix, 444 

10,10. Admissibility of Tests, 446 

10,11. Elliptically Contoured Distributions, 449 

Problems, 454 

Principal Components 459 

11.1. Introduction, 459 

11.2. Definition of Principal Components in the 
Population, 460 

11.3. Maximum Likelihood Estimators of the Principal 
Components and Their Variances, 467 

11.4. Computation of the Maximum Likelihood Estimates of 
the Principal Components, 469 

11.5. An Example, 471 

11.6. Statistical Inference, 473 

11.7. Testing Hypotheses about the Characteristic Roots of a 
Covariance Matrix, 478 

11.8. Elliptically Contoured Distributions, 482 

Problems, 483 

Canonical Correlations and Canonical Variables 487 

12.1. Introduction, 487 

12,2, Canonical Correlations and Variates in the 
Population, 488 

12.3. Estimation of Canonical Correlations and Variates, 498 

12.4, Statistical Inference, 503 

12.5. An Example, 505 

12.6. Linéarly Related Expected Values, 508 


12.7. 


12.8. 


Reduced Rank Regression, 514 
Simultaneous Equations Models, 515 


Problems, 526 


13. The Distributions of Characteristic Roots and Vectors 


13.1; 
13.2. 
13.3; 
13.4, 
13.5. 


13.6. 


13.7. 
13.8, 


Introduction, 528 

The Case of Two Wishart Matrices, 529 

The Case of One Nonsingular Wishart Matrix, 538 
Canonical Correlations, 543 


Asymptotic Distributions in the Case of One Wishart 
Matrix, 545 


Asymptotic Distributions in the Case of Two Wishart 
Matrices, 549 


Asymptotic Distribution in a Regression Model, 555 
Elliptically Contoured Distributions, 563 


Problems, 567 


14 Factor Analysis 


14.1. 
14,2, 
14,3. 


14.4. 
14.5. 
14.6, 
14.7. 


Introduction, 569 
The Model, 570 


Maximum Likelihood Estimators for Random 
Ox:thogonal Factors, 576 


Estimation for Fixed Factors, 586 

Factor Interpretation and Transformation, 587 
Estimation for Identification by Specified Zeros, 590 
Estimation of Factor Scores, 591 


Problems, 593 


15 Patterns of Dependence; Graphical Models 


15.1. 
15.2. 
15.3, 
15.4, 
13:5, 


Introduction, 595 
Undirected Graphs, 596 
Directed Graphs, 604 
Chain Graphs, 610 
Statistical Inference, 613 


Appendix A Matrix Theory 


Al. 
A.2, 


Definition of a Matrix and Operations on Matrices, 624 


Characteristic Roots and Vectors, 631 


CONTENTS 


528 


569 


595 


624 


CONTENTS 


xiii 
A.3. Partitioned Vectors and Matrices, 635 
A.4. Some Miscellaneous Results, 639 
A.5. Gram-Schmidt Orthogonalization and the Sohition of 
Linear Equations, 647 
Appendix B_ Tables 651 
B.1. Wilks’ Likelihood Criterion: Factors C( p,m, M) to 
Adjust to x} ,, where M=n—p+1, 651 
B.2. Tables of Significance Points for the Lawley-Hotelling 
Trace Test, 657 
B.3. Tables of Significance Points for the 
-Bartlett-Nanda-—Pillai Trace Test, 673 
B.4. Tables of Significance Points for the Roy Maximum Root 
Test, 677 
B.5. Significance Points for the Modified Likelihood Ratio 
Test of Equality of Covariance Matrices Based on Equal 
Sample Sizes, 681 
B.6. Correction Factors for Significance Points for the 
Sphericity Test, 683 
B.7. Significance Points for the Modified Likelihood Ratio 
Test 2 = Xo, 685 
References 687 
Index 


713 


Preface to the Third Edition 


For some forty years the first and second editions of this book have been 
used by students to acquire a basic knowledge of the theory and methods of 
multivariate statistical analysis, The book has also served a wider community 
of statisticians in furthering their understanding and proficiency in this field. 
Since the second edition was published, multivariate analysis has been 
developed and extended in many directions. Rather than attempting to cover, 
or even survey, the enlarged scope, I have elected to elucidate several aspects 
that are particularly interesting and useful for methodology and comprehen- 
sion. 

Earlier editions included some methods that could be carried out on an 
adding machine! In the twenty-first century, however, computational tech- 
niques have become so highly developed and improvements come so rapidly 
that it is impossible to include all of the relevant methods in a volume on the 
general mathematical theory. Some aspects of statistics exploit computational 
power such as the resampling technologies; these are not covered here, 

The definition of multivariate statistics implies the treatment of variables 
that are interrelated. Several cliapters are devoted to measures of correlation 
and tests of independence. A new chapter, “Patterns of Dependence; Graph- 
ical Models” has been added. A so-called graphical model is a set of vertices 
Or nodes identifying observed variables together with a new set of edges 
suggesting dependences between variables. The algebra of such graphs is an 
outgrowth and development of path analysis and the study of causal chains. 
A graph may represent a sequence in time or logic and may suggest causation 
of one set of variables by another set. 

Another new topic systematically presented in the third edition is that of 
elliptically contoured distributions. The multivariate normal distribution, 
which is characterized by the mean vector and covariance matrix, has a 
limitation that the fourth-order moments of the variables are determined by 
the first- and second-order moments. The class of elliptically contoured 


XV 


xvi PREFACE TO THE THIRD EDITION 


distribution relaxes this restriction. A density in this class has contours of 
equal density which are ellipsoids as does a normal density, but the set of 
fourth-order moments has one further degree of freedom. This topic is 
expounded by the addition of sections to appropriate chapters. 

Reduced rank regression developed in Chapters 12 and 13 provides a 
method of reducing the number of regression coefficients to be estimated in 
the regression of one set of variables to another. This approach includes the 
limited-information maximum-likelihood estimator of an equation in a simul- 
taneous equations model. 

The preparation of the third edition has been benefited by advice and 
comments of readers of the first and second editions as well as by reviewers 
of the current revision. In addition to readers of the earlier editions listed in 
those prefaces I want to thank Michael Perlman and Kathy Richards for their 
assistance in getting this manuscript ready. 


T. W. ANDERSON 


Stanford, California 
February 2003 


Preface to the Second Edition 


Twenty-six years have passed since the first edition of this book was pub- 
lished. During that time great advances have been made in multivariate 
statistical analysis—particularly in the areas treated in that volume. This new 
edition purports to bring the original edition up to date by substantial 
revision, rewriting, and additions. The basic approach has been maintained, 
uamely, a mathematically rigorous development of statistical methods for 
observations consisting of several measurements or characteristics of each 
subject and a study of their properties. The general outline of topics has been 
retained. 

The method of maximum likelihood has been augmented by other consid- 
erations. In point estimation of the mean vector and covariance matrix 
alternatives to the maximum likelihood estimators that are better with 
respect to certain loss functions, such as Stein and Bayes estimators, have 
been introduced. In testing hypotheses likelihood ratio tests have been 
supplemented by other invariant procedures. New results on distributions 
and asymptotic distributions are given; some significant points are tabulated. 
Properties of these procedures, such as power functions, admissibility, unbi- 
asedness, and monotonicity of power functions, are studied. Simultaneous 
confidence intervals for means and covariances are developed. A chapter on 
factor analysis replaces the chapter sketching miscellaneous results in the 
first edition. Some new topics, including simultaneous equations models and 
linear functional relationships, are introduced. Additional problems present 
further results. 

It is impossible to cover all relevant material in this book; what seems 
most important has been included. For a comprehensive listing of papers 
until 1966 and books until 1970 the reader is referred to A Bibliography of 
Multivariate Statistical Analysis by Anderson, Das Gupta, and Styan (1972). 
Further references can be found in Multivariate Analysis: A Selected and 


xvii 


Xxvill PREFACE TO THE SECOND EDITION 


Abstracted Bibliography, 1957-1972 by Subrahmaniam and Subrahmaniam 
(1973). 

I am in debt to many students, colleagues, and friends for their suggestions 
and assistance; they include Yasuo Amemiya, Jamies Berger, Byoung-Seon 
Choi. Arthur Cohen, Margery Cruise, Somesh Das Gupta, Kai-Tai Fang, 
Gene Golub, Aaron Han, Takeshi Hayakawa, Jogi Henna, Huang Hsu, Fred 
Huffer, Mituaki Huzii, Jack Kiefer, Mark Knowles, Sue Leurgans, Alex 
McMillan, Masashi No, Ingram Olkin, Kartik Patel, Michael Perlman, A!len 
Sampson, Ashis Sen Gupta, Andrew Siegel, Charles Stein, Patrick Strout, 
Akimichi Takemura, Joe Verducci, Marlos Viana, and Y. Yajima. I was 
helped in preparing the manuscript by Dorothy Anderson, Alice Lundin, 
Amy Schwartz, and Pat Struse. Special thanks go to Johanne Thiffault and 
George P. H. Styan for their precise attention. Support was contributed by 
the Army Research Office, the National Science Foundation, the Office of 
Naval Research, and IBM Systems Research Institute. 

Seven tables of significance points are given in Appendix B to facilitate 
carrying out test procedures. Tables 1, 5, and 7 are Tables 47, 50, and 53, 
respectively, of Biometrika Tables for Statisticians, Vol. 2, by E. S. Pearson 
and H. O. Hartley; permission of the Biometrika Trustees is hereby acknowl- 
edged. Table 2 is made up from three tables prepared by A. W. Davis and 
published in Biometrika (1970a), Annals of the Institute of Statistical Mathe- 
matics (1970b) and Communications in Statistics, B. Simulation and Computa- 
tion (1980). Tables 3 and 4 are Tables 6.3 and 6.4, respectively, of Concise 
Statistical Tables, edited by Ziro Yamauti (1977) and published by the 
Japanese Standards Association; this book is a concise version of Statistical 
Tables and Formulas with Computer Applications, JSA-1972. Table 6 is Table 3 
of The Distribution of the Sphericity Test Criterion, ARL 72-0154, by B. N. 
Nagarsenker and K. C. S. Pillai, Acrospacc Research Laboratorics (1972). 
The author is indebted to the authors and publishers listed above for 
permission to reproduce these tables. 


T. W. ANDERSON 


Stanford. California 
June 1984 


Preface to the First Edition 


This book has been designed primarily as a text for a two-semester course in 
multivariate statistics. It is hoped that the book will also serve as an 
introduction to many topics in this area to statisticians who are not students 
and will be used as a reference by other statisticians. 

For several years the book in the form of dittoed notes has been used in a 
two-semeSster sequence of graduate courses at Columbia University; the first 
six chapters constituted the text for the first semester, emphasizing correla- 
tion theory. It is assumed that the reader is familiar with the usual theory of 
univariate statistics, particularly methods based on the univariate normal 
distribution. A knowledge of matrix algebra is also a prerequisite; however, 
an appendix on this topic has been included. 

It is hoped that the more basic and important topics are treated here, 
though to some extent the coverage is a matter of taste. Some of the more 
recent and advanced developments are only briefly touched on in the late 
chapter. 

The method of maximum likelihood is used to a large extent. This leads to 
reasonable procedures; in some cases it can be proved that they are optimal, 
In many situations, however, the theory of desirable or optimum procedures 
is lacking. 

Over the years this manuscript has been developed, a number of students 
and colleagues have been of considerable assistance. Allan Birnbaum, Harold 
Hotelling, Jacob Ilorowitz, Howard Levene, Ingram Olkin, Gobind Seth, 
Charles Stein, and Henry Teicher are to be mentioned particularly. Acknowl- 
edgements are also due to other members of the Graduate Mathematical 


xX PREFACE TO THE FIRST EDITION 


Statistics Society at Columbia University for aid in the preparation of the 
manuscript in dittoed form. The preparation of this manuscript was sup- 
ported in part by the Office of Naval Research. 


T. W. ANDERSON 


Center for Advanced Study 

in the Behavioral Sciences 
Stanford, California 
December 1957 


CHAPTER 1 


Introduction 


1.1, MULTIVARIATE STATISTICAL ANALYSIS 


Multivariate statistical analysis is concerned with data that consist of sets of 
measurements on a number of individuals or objects. The sample data may 
be heights and weights of some individuals drawn randomly from a popula- 
tion of school children in a given city, or the statistical treatment may be 
made on a collection of measurements, such as lengths and widths of petals 
and lengths and widths of sepals of iris plants taken from two species, or one 
may Study the scores on batteries of mental tests administered to a number of 
students. 

The measurements made on a single individual can be assembled into a 
column vector, We think of the entire vector as an observation from a 
multivariate population or distribution. When the individual is drawn ran- 
domly, we consider the vector as a random vector with a distribution or 
probability law describing that population. The set of observations on all 
individuals in a sample constitutes a sample of vectors, and the vectors set 
side by side make up the matrix of observations." The data to be analyzed 
then are thought of as displayed in a matrix or in several matrices. 

We shall see that it is helpful in visualizing the data and understanding the 
methods to think of each observation vector as constituting a point in a 
Euclidean space, each coordinate corresponding to a measurement or vari- 
able. Indeed, an early step in the statistical analysis is plotting the data; since 


‘When data are listed on paper by individual, it is natural to print the measurements on one 
individual as a row of the table; then one individual corresponds to a row vector. Since we prefer 
to operate algebraically with column vectors, we have chosen to treat observations in terms of 
column vectors. (In practice, the basic data set may well be on cards, tapes, or disks.) 


An Introductibn to Mulsivanate Statistical Analysis, Third Edition. By T. W. Anderson 
ISBN 0-471-36091-0 Copyright © 2003 John Wiley & Sons, Inc. 


Z INTRODUCTION 


most statisticians are limited to two-dimensional plots, two coordinates of the 
observation are plotted in turn. 

Characteristics of a univariate distribution of essential interest are the 
mean as a measure of location and the standard deviation as a measure of 
variability; similarly the mean and standard deviation of a univariate sample 
are important summary measures. In multivariate analysis, the means and 
variances of the separate measurements—for distributions and for samples 
—have corresponding relevance. An essential aspect, however, of multivari- 
ate analysis is the dependence between the different variables. The depen- 
dence between two variables may involve the covariance between them, that 
is, the average products of their deviations from their respective means. The 
covanlance Standardized by the corresponding standard deviations is the 
correlation coefficient; it serves as a measure of degree of dependence. A set 
of summary statistics is the mean vector (consisting of the univariate means) 
and the covariance matrix (consisting of the univariate variances and bivari- 
ale covariances). An alternative set of summary statistics with the same 
information is the mean vector, the set of standard deviations, and the 
correlation matrix. Similar parameter quantities describe location, variability, 
and dependence in the population or for a probability distribution. The 
multivariate normal distribution is completely determined by its mean vector 
and covariance matrix, and the sample mean vector and covariance matrix 
constitute a Sufficient set of statistics. 

The measurement and analysis of dependence between variables, between 
sets of variables, and between variables and sets of variables are fundamental 
to multivariate analysis. The multiple correlation coefficient is an extension 
of the notion of correlation to the relationship of one variable to a set of 
variables. The partial correlation coefficient is a measure of dependence 
between two variables when the effects of other correlated variables have 
been removed. The various correlation coefficients computed from samples 
are used to estimate corresponding correlation coefficients of distributions. 
In this book tests of hypotheses of independence are developed. The proper- 
ties of the estimators and test procedures are studied for sampling from the 
multivariate normal distribution. 

A number of statistical problems arising in multivariate populations are 
straightforward analogs of problems arising in univariate populations; the 
suitable methods for handling these problems are Similarly related. For 
example, in the univariate case we may wish to test the hypothesis that the 
mean of a variable is zero; in the multivariate case we may wish to test the 
hypothesis that the vector of the means of several variables is the zero vector. 
The analog of the Student t-test for the first hypothesis is the generalized 
T*-test. The analysis of variance of a single variable is adapted to vector 


1.22. THE MULTIVARIATE NORMAL DISTRIBUTION 3 


observations; in regression analysis, the dependent quantity may be a vector 
variable. A comparison of variances is generalized into a comparison of 
covariance matrices. 

The test procedures of univariate statistics are generalized to the multi- 
variate case in such ways that the dependence between variables is taken into 
account. These methods may not depend on the coordinate system; that is, 
the procedures may be invariant with respect to linear transformations that 
leave the null hypothesis invariant. In some problems there may be families 
of tests that are invariant; then choices must be made. Optimal properties of 
the tests are considered. 

For some other purposes, however, it may be important to select a 
coordinate system so that the variates have desired statistical properties. One 
might say that they involve characterizations of inherent properties of normal 
distributions and of samples. These are closely related to the algebraic 
problems of canonical forms of matrices. An example is finding the normal- 
ized linear combination of variables with maximum or minimum variance 
(finding principal components); this amounts to finding a rotation of axes 
that carries the covariance matrix to diagonal form. Another example is 
characterizing the dependence between two sets of variates (finding canoni- 
cal correlatious). These problems involve the characteristic roots and vectors 
of various matrices. The statistical properties of the corresponding sample 
quantities are treated. 

Some Statistical problems arise in models in which means and covariances 
are restricted. Factor analysis may be based on a model with a (population) 
covariance matrix that is the sum of a positive definite diagonal matrix and a 
positive semidefinite matrix of low rank; linear strictural relationships may 
have a similar formulation. The simultaneous equations system of economet- 
rics is another example of a special model. 


1.2. THE MULTIVARIATE NORMAL DISTRIBUTION 


The statistical methods treated in this book can be developed and evaluated 
in the context of the multivariate normal distribution, though many of the 
procedures are uSeful and effective when the distribution sampled is not 
normal. A major reason for basing statistical analysis on the normal distribu- 
tion is that this probabilistic model approximates well the distribution of 
continuous measurements in many sampled populations. In fact, most of the 
methods and theory have been developed to serve statistical analysis of data. 
Mathematicians such as Adrian (1808), Laplace (1811), Plana (1813), Gauss 


4 INTRODUCTION 


(1823), and Bravais (1846) studicd the bivariate normal density. Francis 
Galton, the geneticist, introduced the ideas of correlation, regression, and 
homoscedasticity in the study of pairs of measurements, one made on a 
parent and one in an offspring. [See, e.g., Galton (1889).] He enunciated the 
theory of the multivariate normal distribution as a generalization of observed 
properties of szmples. 

Karl Pearson and others carried on the development of the theory and use 
of different kinds of correlation coefficients’ for studying problems in genet- 
ics, biology, and other fields. R. A. Fisher further developed methods for 
agriculture, botany, and anthropology, including the discriminant function for 
classification problems. In another direction, analysis of scores 01 mental 
tests led to a theory, including factor analysis, the sampling theory of which is 
based on the normal distribution. In these cases, as well as in agricultural 
experiments, in engineering problems, in certain economic problems, and in 
other fields, the multivariate normal distributions have been found to be 
sufficiently close approximations to the populations so that statistical analy- 
ses based on these models are justified. 

The univariate normal distribution arises frequently because the effect 
studied is the sum of many independent random effects. Similarly, the 
multivariate normal distribution often occurs because the multiple measure- 
ments are sums of small independent effects. Just as the central limit 
theorem leads to the univariate normal distrioution for single variables, so 
does the general central limit theorem for several varjables lead to the 
multivariate normal distribution. 

Statistical theory based on the normal distribution has the advantage that 
the multivariate methods based on it are extensively developed and can be 
studied in an organized and systematic way. This is due not only to the need 
for such methods because they are of practical us::, but also to the fact that 
normal theory is amenable to exact mathematical treatment. The ‘suitable 
methods of analysis are mainly based on standard operations of matrix 
algebra; the distributions of many statistics involved can be obtained exactly 
or at least characterized; and in many cases optimum properties of proce- 
dures can be deduced. 

The point of view in this book is to state problems of inference in terms of 
the multivariate normal distributions, develop efficient and often optimum 
methods in this context, and evaluate significance and confidence levels in 
these terms. This approach gives coherence and rigor to the exposition, but, 
by its very nature, cannot exhaust consideration of multivariate stztistical 
analysis. The procedures are appropriate to many nonnormal distributions, 


TFor a detailed study of the development of the ideas of correlation, see Walker (1931), 


1.2 THE MULTIVARIATE NORMAL DISTRIBUTION 5 


but their adequacy may be open to question. Roughly speaking, inferences 
about means are robust because of the operation of the central limit 
theorem, but inferences about covariances are sensitive to normality, the 
variability of sample covariances depending on fourth-order moments. 

This inflexibility of normal methods with respect to moments of order 
greater than two can be reduced by including a larger class of elliptically 
contoured distributions. In the univariate case the normal distribution is 
determined by the mean and variance; higher-order moments and properties 
such as peakedness and long tails are functions of the mean and variance. 
Similarly, in the multivariate case the means and covariances or the means, 
variances, and correlations determine all of the properties of the distribution. 
That limitation is alleviated in one respect by consideration of a broad class 
of elliptically contoured distributions. That class maintains the dependence 
structure, but permits more general peakedness and long tails. This study 
leads to more robust methods, 

The development of computer technology has revolutionized multivariate 
Statistics in several respects. As in univariate statistics, modern computers 
permit the evaluation of observed variability and significance of results by 
resampling methods, such as the bootstrap and cross-validation. Such 
methodology reduces the reliance on tables of significance points as well as 
eliminates some restrictions of the normal distribution. 

Nonparametric techniques are available when nothing is known about the 
underlying distributions. Space does not permit inclusion of these topics as 
well as other considerations of data analysis, such as treatment of outliers 
and transformations of variables to approximate normality and homoscedas- 
ticity. 

The availability of modern computer facilities makes possible the analysis 
of large data sets and that ability permits the application of multivariate 
methods to new areas, such as image analysis, and more effective analysis of 
data, such as meteorological. Moreover, new problems of statistical analysis 
arise, such as sparseness of parameter or data matrices. Because hardware 
and software development is so explosive and programs require specialized 
knowledge, we are content to make a few remarks here and there about 
computation. Packages of statistical programs are available for most of the 
methods. 


CHAPTER 2 


The Multivariate 
Normal Distribution 


2.1. INFRODUCTION 


In this chapter we discuss the multivariate normal distribution and some of 
its properties. In Section 2.2 are considered the fundamental notions of 
multivariate distributions: the definition by means of multivariate density 
functions, marginal distributions, conditional distributions, expected values, 
and moments. In Section 2.3 the multivariate normal clistribution is defined; 
the parameters are shown to be the means, variances, and covariances or the 
means, variances, and correlations of the components of the random vector. 
In Section 2.4 it is shown that linear combinations of normal variables are 
normally distributed and hence that marginal distributions are normal. In 
Section 2.5 we see that conditional distributions are also normal with means 
that are linear functions of the conditioning variables; the coefficients are 
regression coefficients. The variances, covariances, and correlations—called 
partial correlations—are constants, The multiple correlation coefficient is 
the maximum correlation between a scalar random variable and linear 
combination of other random variables; it is a measure of association be- 
tween one variable and a set of others. The fact that marginal and condi- 
tional distributions of normal distributions are normal makes the treatment 
of this family of distributions coherent. In Section 2.6 the characteristic 
function, moments, and cumulants are discussed. In Section 2.7 elliptically 
contoured distributions are defined; the properties of the normal distribution 
are extended to this larger class of distributions. 


dn Introduction to Multrvariate Statistical Analysis, Tlard Editon, By ‘TY, W. Anderson 
ISBN 0-471-36091-0 Copyright © 2003 John Wiley & Sons, Inc. 


6 


2.2 NOTIONS OF MULTIVARIATE DISTRIBUTIONS - 
2.2. NOTIONS OF MULTIVARIATE DISTRIBUTIONS 


2.2.1. Joint Distributions 


In this section we shall consider the notions of joint distributions of several 
variables, derived marginal distributions of subsets of variables, and derived 
conditional distributions. First consider the case of two (real) random 
variablest X and Y. Probabilities of events defined in terms of these variables 
can be obtained by operations involving the cumulative distribution function 
(abbreviated as cdf), 


(1) F(x,y) =Pr{X <x, Y<y}, 


defined for every pair of real numbers (x, y). We are interested in cases 
where F(x, y) is absolutely continuous; this means that the following partial 
derivative exists almost everywhere: 


(2) FAY = F(x,y), 
and 
(3) Fon y)= ff fun) dudv. 


The nonnegative function f(x, y) is called the density of X and Y. The pair 
of random variables (X,Y) defines a random point in a plane. The probabil- 
ity that CX, Y) falls in a rectangle is 
(4) PrfxsXsx+Ax,y<YsytAy} 

7=F(x+tAx,y+Ay) -—F(x+Ax,y) -—F(x,y+Ay) +F(x,y) 

A 
= repre “f(u,v) dudv 
y 


x 


(Ax >0, Ay > 0). The probability of the random point (X,Y) falling in any 
set E for which the following int-gral is defined (that is, any measurable set 
E) is 


(5) Pr{( X,Y) €E} =f [fox dx dy. 


‘In Chapter 2 we shall distinguish between random variables and running variables by use of 
capital and lowercase letters, respectively. 1n later chapters we may be unable to hold to this 
convention because of other complications of notation, 


8 THE MULTIVARIATE NORMAL DISTRIBUTION 


This follows from the definition of the integral [s the limit of sums of the 
sort (4)]. If f(x, y) is continuous in both variables, the probability element 
f(x, y) AyAx is approximately the probability that X falls between x and 
x+ Ax and Y falls between y and y + Ay since 


(6) Prix<X<xt+Ax,y<Y<y+ Ay} SO Gist) du dv 
¥ 


x 
=f (Xo, Yo) Ax Ay 


for some Xo, ¥p (X SX SX + Ax, ySyy<y+Ay) by the mean value theo- 
rem of calculus. Since f(u, v) is continuous, (6) is approximately f(x, y) Ax Ay. 
In fact, 


; 1 
(7) lim, Ayady Prix sX sx + Ax, ys¥syt Ay} 
Ay~0 


— f(x,y) AxAy| =0. 


Now we consider the case of p random variables X,, X2,...,X,. The 
cdf is 
(8) Fae.) = PM Seawt, Se, 
defined for every set of real numbers x,,...,x,. The density function, if 


F(x),-.-,%,) is absolutely continuous, is 
(9) 
(almost everywhere), and 

ap x] 
(10) FU Risse) op ee CL du. 


The probability of falling in any (measurable) set R in the p-dimensional 
Euclidean space is 


(11) Pri(Xj,...,X,) ©R} = [ = | figs, aay ee 


The probability element f(x,,...,x,) Ax, -*- Ax, is approximately the prob- 
ability Prix, <X, <x, +Ax,...,x,<5X,<x,+Ax,} if flxy,...,x,) is 


2.2. NOTIONS OF MULTIVARIATE DISTRIBUTIONS 9 


continuous. The joint moments are defined as’ 


p* 


(12) &xh Xb = fe f xb ve xhof(x1,...5x,) dey we de 


2.2.2. Marginal Distributions 


Given the cdf of two randcm variables X, Y as being F(x, y), the marginal 
cdf of X is 


(13) Pr{ X <x} =Pr{X <x, ¥ <0} 
= F(x,0). 


Let this be F(x). Clearly 


(14) F(x) = { f fe) dy du. 
We call 
(15) mic du =f(u), 


say, the marginal density of X. Then (14) is 


(16) F(x) =f FO) du. 


In a similar fashion we define G(y), the marginal cdf of Y, and g(y), the 
marginal density of Y. 

Now we turn to the general case. Given F(x,,...,x,) as the cdf of 
X\,...,X,, we wish to find the marginal cdf of some of Xj,...,X,, say, of 
X\,--.>X, (r<p). It is 


(17) Pr{ X, <%,,..-,X,<x,} 
= Pr{X1 $x...) X, 5%) X41 S0,...5X_ < CO} 


= FC 2 jy any Xo Oy 28-408) 


The marginal density of X,...,X, is 
Lea) tea) 
(18) fe ices) du,,, °**du,. 


'g will be used to denote mathematical expectation, 


10 THE MULTIVARIATE NORMAL DISTRIBUTION 


The marginal distribution and density of any other subset of Xj,...,X, ate 
obtained in the obviously similar fashion. 

The joint moments of a subset of variates can be computed from the 
marginal distribution; for example, 


(19) ox XN = éxin XH XO =X? 


oc oC 
= eh hy. 4, ae 
= fof xt xh f(ay-0%,) de, de, 
faio 2) oo 
x x 
=f f xf xh 
ax —_x 


2.2.3. Statistical Independence 
Two random variables X, Y with cdf F(x, y) are said to be independent if 
(20) F(x,y) =F(x)G(y), 


where F(x) is the marginal cdf of X and G(y) is the marginal cdf of Y. This 
implies that the density of X, Y is 


(21) igo os 0°F (x)G(y) 


Ox dy Ox Oy 


_ aF(x) dG(y) 
Ser ae ee 


=f(x)a(y). 
Conversely, if flx, y) = f(x)g(y), then 


(22) F(x,y)= ff fee) dudu = ff fos) du du 


= ff duf” a(2) du=F(x)G(y). 


Thus an equivalent definition of independence in the case of densities 
existing is that f(x, y)=f(x)g(y). To see the implications of statistical 
independence, given any x, <X5, y; <y>, We consider the probability 


(23) Pr{x)<X <x), y, <¥ <ys} 


= fore, v) dudy = ff) au f“3(0) du 


Yi 7X] 


= Prix, <X <x,} Pr{y; s¥<yy}. 


2.2. NOTIONS OF MULTIVARIATE DISTRIBUTIONS 11 


The probability of X falling in a given interval and Y falling in a given 
interval is the product of the probability of falling in the interval and the 
probability of Y falling in the other interval. 

If the ccf of X,,..., X, is F(x,,-..,x,), the set of random variables is said 
to be mutually independent if 


(24) F(X)5++.54%p) =Fi(x1) ~ F(2,), 


where F;(x,) is the marginal cdf of X,,i=1,..., p. The set X|,..-, X, is said 


to be indepeudent of the set X,,,,..., X, if 


(25) F(x,.--)%,) = F(xy,-+-5%,,00,...,00) F(00,...,00,4,4,,.+-,4,)- 


One result of independence is that joint moments factor. For example, if 
X),-..,X, ate mutually independent, then 


(26) EXP XPe = foe fateh) fy (%p) der a, 


= Lf shee) ay 


i=) 
P 


= [[{ex"}. 


t=! 


2.2.4. Conditional Distributions 


If A and 8 are two events such that the probability of A and B occurring 
simultaneously is P(AB) and the probability of & occurring is P(8) > 0, 
then the conditional probability of A occurring given that B has occurred is 
P(AB)/P(B). Suppose the event A is X falling in the interval [x,, x,] and 
the event B is Y falling in [y,, y,]. Then the conditional probability that 
falls in [x,, x2], given that Y falls in [y,, y.], is 


Pr{ x, <X <x, yi <Y<y,} 


21 Prix, <X <x,|ly,<¥<y.}= 
Sale tm zu ¥2) Pr{y; SY <y,} 


*2 fy2 
_ A J (4,0) dodu 
y2 . 
[3(v) de 


Now let y, =y, y. =y + Ay. Then for a continuous density, 


(28) f 7") (uv) du=g(y*) Ay, 


12 THE MULTIVARIATE NORMAL DISTRIBUTION 


where y <y* <y + Ay. Also 


+Ay., 
(29) [7 ruye) do=s lu y*(w)] Ay, 
y 
where y <y*(u) <y + Ay. Therefore, 
x flu,y*(u)] 
30 Pr(x,<X<x.ly<s¥<y+dAy}=f G2} a, 
( ) { I 2ly y y} iA g(9*) 


It will be noticed that for fixed y and Ay (> 0), the integrand of (30) behaves 
as a univariate density function. Now for y such that g(y)>0, we define 
Pr{x, < ¥ <1,1Y =y}, the probability that X lies between x, and x,, given 
that Y is y, as the limit of (30) as Ay > 0. Thus 


(31) Prix, <X <x,l¥=y}= fo f(uly) du, 


where f(uly) = f(u, y)/g(y). For given y, f(uly) is a density function and is 
called the conditional density of X given y. We note that if X and Y are 
independent, f(xly) = f(x). 

In the general case of X\,...,X, with cdf F(x,,...,x,), the conditional 
density of X),..., X,, given X,.) =Xp41,+-+) Xp, =Xp, is 


Fatyseyx,) 


(32) en a REEL 
ee fibre es Srp tse oes Xp) au, + du, 


For a more general discussion of conditional probabilities, the reader is 
referred to Chung (1974), Kolmogorov (1950), Loéve (1977), (1978), and 
Neveu (1965). 


2.2.5. Transformation of Variables : 


Let the density of X,,...,X, be f(x,,...,%,). Consider the p real-valued 
functions 


(33) WH VS y)s i=1,...,p. 


We assume that the transformation from the x-space to the y-space is 
one-to-one;' the inverse transformation is 


(34) KF, Nie eps i=],...,p- 


"More precisely. we assume this is true for the part of the x-space for which Flxy..+s%)) is 
positive. 


2.3. THE MULTIVARIATE NORMAL DISTRIBUTION 13 


Let the random variables Y,,...,¥, be defined by 

(35) Y,~y(X15-+.. Xp), i=1,...,p. 
Then the density of Y,,...,¥, is 

(0) Fay) =f Re ODO a aD: 
where J(y,,...,y,) is the Jacobian 


OX, AX, Ox, 
OY, Ya oY, 


OX OX, OX 
(37) J(¥15-++» ¥p) = mod se ou °%p . 

2a 2 

OY, AY. OY, 


We assume the derivatives exist, and “mod” means modulus or absolute value 
of the expression following it. The probability that (X,,...,X,) falls in a 
region R is given by (11); the probability that (Y,,..., Y,) falls in a region S is 


(38) Pr{(¥,,....¥,) 5} = f . Jel vires ¥p) dy, *** dy,. 


If S$ is the transform of R, that is, if each point of R transforms by (33) into a 
point of S and if each point of S transforms into R by (34), then (11) is equal 
to (38) by the usual theory of transformation of multiple integrals. From this 


follows the assertion that (36) is the density of Yyy.00s¥) 


2.3. THE MULTIVARIATE NORMAL DISTRIBUTION 


The univariate normal density function can be written 
. 
(1) - kew ale By? ke7 Hx Bale B) 


Where a is positive and k is chosen So that the integral of (1) over the entire 
x-axis is unity. The density function of a multivariate normal distribution of 
X\,...,X, has an analogous form. The scalar variable x is replaced by a 
vector 

* i 


(2) x= 


14 THE MULTIVARIATENORMAL DISTRIBUTION 


the scalar constant 8 is replaced by a vector 


(3) b=|- 1; 


and the positive constant a is replaced by a positive definite (symmetric) 
matrix 


@y 42 7 Gp 

Q., An ar, 
(4) A= ; 

Gy, Ay? Gap 


The square a(x - BY =(x- B)a(x- B) is replaced by the quadratic form 


(5) (x~b)'A(x-b) = YF a,,(x,-b;)(x;- 8;). 


ig=l 
Thus the density function of a p-variate normal distribution is 
! ' 
(6) Fhe) = Kem tx- bled) 


where K (> 0) is chosen so that the integral over the entire p-dimensional 
Euclidean space of x,,...,x, is unity. 

Written in matrix notation, the similarity of the multivariate normal 
density (6) to the univariate density (1) is clear. Throughout this book we 
shall use matrix notation and operations. Th: reader is referred to the 
Appendix for a review of matrix theory and for definitions of our notation for 
matrix operations. 

We observe that f(x,,...,% p) is nonnegative. Since A is positive definite, 


(7) {x —b)'A(x—b) = 0, 
and therefore the density is bounded; that is, 
(8) PM poed,) eK. 


Now let us determine K so that the integral of (6) over the p-dimensional 
space is one. We shall evaluate 


x 


* 1 £ 
(9) K* = [ a e HEMDY ALB) ewe dhe, 


| 


2.3. THE MULTIVARIATE NORMAL DISTRIBUTION 15 


We use the fact (see Corollary A.1.6 in the Appendix) that if A is positive 
definite, there exists a nonsingular matrix C such that 


(10) C'AC =I, 


where I denotes the identity and C’ the transpose of C. Let 


(11) x—-b=Cy, 
where 
y1 
(12) y=]: 
Yp 
Then 
(13) (x —b)'A(x—b) =y'C'ACy=y’y. 


The Jacobian of the transformation is 
(14) J =mod|Cl, 


where mad|C| indicates the absolute value of the determinant of C. Thus (9) 
becomes 


(oa) 00 re 
(15) K*=modlC| fv fe 97 dy, dy,. 
We have 
me: 1 Pp v Pe 
(16) e-ermexp| “5. |= lle, 
i=l i=1 


where exp(z) = e?. We can write (15) as 
(17) K* =mod|C| ec jie sie we dy, dy, 
p 


=mod|C| [| 1 en Br ay 


f=] 


p 


= mod|Cl [] {v27} 


t=] 


= mod |C|(27)?” 


16 THE MULTIVARIATE NORMAL DISTRIBUTION 


by virtue of 


(18) ae dt = 1. 

Corresponding to (10) is the determinantat equation : 
(19) Ic’ -1Al [Cl =U. 

Since 

(20) Ic'| =Icl, 


and since |Z| = 1, we deduce from (19) that 


(21) mod|C| = 1/yIAI . 
Thus 
(22) K =1/K* = ylAl (20) 7? 


The normal density function is 


|A| 
(2a)” 


e7 xb AaB) 


(23) 


We shalt now show the significance of b and A by finding tne first and 
second moments of Xj,...,X,. It will be convenient to consider these 
random va;iables as constituting a random vector 


(24) X= 


We shalt define generally a random matrix and the expected value of a 
random matrix; a random vector Is considered as a special case of a random 
matrix with one column. 

Definition 2.3.1. A random matrix Z is a matrix 


(25) Z=(Zy)s gt leowm, h=l,..7, 


of random variables Z,,,...,Z 


’ pe? 


2.3 THE MULTIVARIATE NORMAL DISTRIBUTION 17 


If the random variables Z,,,...,Z,,, can take on only a finite number of 
values, the random matrix Z can be one of a finite number of matrices, say 
Z(1),..., Z(q). If the probability of Z=Z(i) is p,, then we should like to 
define @Z as LY,Z()2,. Then &Z=(&Z,,). If the random variables 
Zi+++,Zm, have a joint density, then by operating with Riemann sums we 
can define &Z as the limit (if the limit exists) of approximating sums of the 
kind occurring in the discrete case; then again &Z=(@Z,,). Therefore, in 
general we shall use the following definition: 


Definition 2.3.2. The expected value of a random matrix Z is 
(26) &Z=(€Z,,); g=1,....m, h=l,...,0 
In particular if Z is X defined by (24), the expected value 


&X, 
(27) X= 


is the mean or mean vector of X. We shall usually denote this mean vector by 
p. If Z is (X ~ pXX — pw)’, the expected value is 


(28) @(X) = &(X~p)(X- 4)! =[F(X,- 4,)(X-4,)] 


the covariance or covariance matrix of X. The ith diagonal element of this 
matrix, &(X,— ,)’, is the variance of X,, and the i, jth off-diagonal ele- 
ment, &(X;— 4; X; ~ u;), is the covariance of X; and 4’,, i#j. We shall 
usually denote the covariance matrix by %. Note that 


(29) (X) = &( XX' — pX'—-Xp' + pp’) = PXX' — py’. 


The operation of taking the expected value of a random matrix (or vector) 
satisfies certain rules which we can summarize in the following lemma: 


Lemma 2.3.1. /f Z is an m Xn random matrix, D is an 1X m real matrix, 
Eis ann Xq real matrix, and F is an 1 Xq real matrix, then 


(30) @(DZE+F)=D(@Z)E+F. 


18 THE MULTIVARIATE NORMAL DISTRIBUTION 


Proof. The element in the ith row and jth column of &(DZE + F) is 
(31) 6X di, Zngeg, +f, = 2 dual OZng eg, td is 
18 & 


which is the element in the ith row and jth column of D(@Z)E+ F. 
a 


Lemma 2.3.2. Jf Y= DX +f, where X is a random vector, then 
(32) €Y=DEX+f, 
(33) @(Y)=De(X)D'. 


Prouf. The first assertion follows directly from Lemma 2:3.1, and the 
second from 


(34) 6(Y)=€&(Y-€Y)\(Y¥-€éY)' 
= &|DX+f-(DéX+f)|[DX+f—-(DéX+f)]’ 
= €|D(X- &X)|[D(X- éx)]’ 
= €[D(X- &X)(X- €X)'D'], 
which yields the right-hand side of (33) by Lemma 2.3.1. a 
When the transformation corresponds to (11), that is, X= CY+5, then 


&€X=C&Y +b. By the transformation theory given in Section 2.2, the density 
of Y is proportional to (16); that is, it is 


] lie z 1 12 
: oS = e vr), 
ae ae 


(35) 
The expected value of the ith component of Y is 


(36) éY,= iz “fx 1(4= ei} dy, « dy, 


rea ye dy Mf pee vy 


eo 
1 -t2 
ied eae 


=0. 


2.3 THE MULTIVARIATE NORMAL DISTRIBUTION 19 


The fast equality follows becauset ye is an odd function of y,. Thus 
&Y=0. Therefore, the mean of X, denoted by p, is 


(37) w= X=), 


From (33) we see that @(X)=C(&YY’)C’. The i, jth element of @YY’ is 


ro) 00 p 1 sie 
(38) éY,Y,= Pte, {] Von *) dy, “+ dy, 


h=l 


because the density of Y is (35). If i=j, we have 


h=)1 
Ah#t 


roe) ; Pp co 
(39) sy? = >= f wea Tf Fee May | 
1 9 a 
=]. 


The last equality follows because the next to last expression is the expected 
value of the square of a variable normally distributed with mean 0 and 
variance 1. If i #j, (38) becomes 


1 a 1,2 1 ea 12 
1. ew .————— ~y ‘ 
(40) EVN = Tf ye dy Tf ye Ody 
1 Or 3 
ae ~wh d 
i | 2a Js | 
h+#i,j 

= 0, L#j, 


since the first integration gives 0. We can summarize (39) and (40) as 


(41) éyy'= I, 
Thus 
(42) &(X—p)(X- p)'=CIC’ =CC’. 


From (10) we obtain A =(C’)"'C~! by multiplication by (C’)~! on the left 
and by C™' on the right. Taking inverses on both sides of the equality 


‘Alternatively, the last equality follows because the next to last expression is the expected value 
of a normally distributed variable with mean 0. 


20 THE MULTIVARIATE NORMAL DISTRIBUTION 


gives us 

(43) cc'=aA"', 

Thus, the covariance matrix of X is 

(4) B= €(X-p)(X-p)'=47' 

From (43) we see that & is positive definite. Let us summarize these results. 


Theorem 2.3.1. Jf the density of a p-dimensional random vector X is (23), 
then the expected value of X is b and the covariance matria is A~'. Conversely, 
given a vector ~ and a positive definite matrix %, there is a multivariate normal 
density 


(45) (2m) P|] “fen Hares ew) 


such that the expected value of the vector with this density is w and the covariance 
matrix is X. 


We shall denote the density (45) as n(x|p,:%) and the distribution law as 
N(p, =). 

The ith ciagonal element of the covariance matrix, g;,, is the variance of 
the ith component of X; we may sometimes denote this by o;°. The 
correlation coefficient between X, and X, is defined as 


(46) py = 


Ee ee 
V%UV%G  — %% 
This measure of association is symmetric in X; and X;: p;, = p, Since 


. 2 

O, yj g, g,9, Pi 
(47) = 
%% A} \A%P, 9 


2 


is positive definite (Corollary A.1.3 of the Appendix), the determinant 


2 
TO, Py 


(48) = 0,70,7(1— pi) 

G7 Pi, 
is positive. Therefore, —1<p,, <1. (For singular distributions, see Section 
2.4.) The multivariate normal density can be parametrized by the means Has 
i=1,...,p, the variances o,”, i=1,...,p, and the correlations Pij> ‘<b 


tf = Lywisy pe 


2.3. THE MULTIVARIATE NORMAL DISTRIBUTION 21 


As a special case of the preceding theory, we consider the bivariate normal 
distribution. The mean vector is 


x, My 
(49) (x l-(% 


the covariance matrix may be written 


Keay X- X,- 
(50) z= (X, iy) (X, — 1) (% : M2) 
(X_ — H2)(X,- #1) (X,— Hy) 
ee, Sas a; a a2 
1 OR 0,02 P a; 


where o, is the variance of X,, o/ the variance of X,, and p the 


correlation between X, and X,. The inverse of (50) is 


i bos ep 

(51 Sie i pas 
) Papiiei= cle 
O02 a; 


The density function of X, and X, 1s 


a 
C2) ear asl a; 


phere) GS) 4 (27 Hn) | 


O72 a, 

Theorem 2.3.2. The correlation coefficient p of any bivariate distribution is 

invariant with respect to transformations X;* = b,X; +; 6; >0, i= 1,2. Every 

function of the parameters of a bivariate normal distribution that is invariant with 
respect to sueh transformations is a function of p. 


Proof. The variance of X;* is b7o,’, i= 1,2, and the covariance of X* and 
X} is b,b,0,0,p by Lemma 2.3.2. Insertion of these values into the 
definition of the correlation between Xj} and X} shows that it is p. If 
fC hy, 2s 7}, Tz, p) is invariant with respect to such transformations, it must 
be f(0,0,1,1, p) by choice of b, =1/a, and c; = —p,/o;, ¢= 1,2. a 


22 THE MULTIVARIATE NORMAL DISTRIBUTION 


The correlation coefficient p is the natural measure of association between 
X, and X,. Any function of the parameters of the bivariate normal distribu- 
tion that is independent of the scale and location parameters is a function of 
p. The standardized variable (or standard score) is Y,=(X;— y,)/o;. The 
mean squared difference between the two standardized variables is 


(53) €(Y,- Y,)’ = 2(1- p). 


The smaller (53) is (that is, the larger p is), the more similar Y, and Y, are. If 
p> 0, X, and X, tend to be positively related, and if p< 0, they tend to be 
negatively related. If p=0, the density (52) is the product 0 the marginal 
densities of X, and X,; hence X, and X, are independent. 

It will be noticed that the density function (45) is constant on elfipsoids 


(54) (x-p)'2(x-p)=c 


for every positive value of c in a p-dimensional Euclidean space. The center 
of each ellipsoid is at the point x. The shape and orientation of the ellipsoid 
are determined by %, and the size (given 2) is determined by c. Because (54) 
is a sphere if X =o 77, n(x] p, 071) is known as a spherical normal density. 

Let us consider in detail the bivariate case of the density (S2). We 
transform coordinates by (x, — u,)/o,=y,, i= 1,2, so that the centers of the 
loci of constant density are at the origin. These loci are defined by 


1 


(55) a 


(yj —2py,y2 +y3) =e. 


The intercepts on the y-axis and y-axis are equal. If p > 0, the major axis of 
the ellipse is along the 45° line with a length of 2¥c(1 +p), and the minor 
axis has a length of 2¥c(1— p). If p <0, the major axis is along the 135° fine 
with a length of 2¥c(1— p), and the minor axis has a length of 2¥c(1 +p). 
The value of p determines the ratio of these lengths. In this bivariate case we 
can think of the density function as a surface above the plane. The contours 
of equal density are contours of equal altitude on a topographical map; they 
indicate the shape of the hill (or probability surface). If p> 0, the hill will 
tend to run along a line with a positive slope; most of the hilt will be in the 
first and third quadrants. When we transform back to x,;=0;y;+ @;, we 
expand each contour by a factor of go, in the direction of the ith axis and 
shift the center to ( 4, My)- 


2.4 LINEAR COMBINATIONS; MARGINAL DISTRIBUTIONS 23 


The numerical values of the cdf of the univariate normal variable are 
obtained from tables found in most statistical texts. The numerical values of 


(56) F(x,,%,) = Pr{ X, <x,, X, <x} 


X, 7b X27 M2 
Pr| a; ~ Sy, SY) 


Loe) = 


where y, =(x,—~ u,)/o, and y,=(x,— 4.)/o,, can be found in Pearson 
(1931). An extensive table has been given by the National Bureau of Stan- 
dards (1959). A bibliography of such tables has been given by Gupta (1963). 
Pearson has also shown that 


(57) F(X),%2) = % pity) 74(92)> 
j= 


where the so-called tetrachoric functions 7,(y) are tabulated in Pearson (1930) 
up to 7,9(y). Harris and Soms (1980) have studied generalizations of (57). 


2.4, THE DISTRIBUTION OF LINEAR COMBINATIONS OF 
NORMALLY DISTRIBUTED VARIATES; INDEPENDENCE 
OF VARIATES; MARGINAL DISTRIBUTIONS 


One of the reasons that the study of normal multivariate distributions is so 
useful is that marginal distributions and conditional distributions derived 
from multivariate normal distributions are also normal distributions. More- 
over, linear combinations of multivariate normal variates are again normally 
distributed. First we shall show that if we make a nonsingular linear transfor- 
mation of a vector whose components have a joint distribution with a normal 
density, we obtain a vector whose components are jointly distributed with a 
normal density. 


Theorem 2.4.1. Let X (with p components) be distributed according to 
N(p, &). Then 
(1) Y=CX 


is distributed according to N(C,C%C') for C nonsingular. 


Proof. The density of Y is obtained from the density of X, n(x|p, =), by 
replacing x by 


(2) x=C"'y, 


24 THE MULTIVARIATE NORMAL DISTRIBUTION 


and multiplying by the Jacobian of the transformation (2), 


: ope marae Lee 
3 — 


The quadratic form in the exponent of n(x|p, %) is 


(4) Q=(x-p)'E"(x-p). 
The transformation (2) carries Q into 
(5) Q=(Cly—p)'E'(Cly-p) 


= (C7'y -C™!Cp)'S(C7!y ~ C7! Cp) 
=[C'(y-Cp)]'="'[c '(y-Cp)] 
=(y-Cp)'(CC)'21C"'(y- Cp) 
=(y—Cp)'(CEC’)"'(y~ Cp) 


since (C~!)' =(C')~' by virtue of transposition of CC~'=J. Thus the 
density of ¥ is 


(6) n(C~'ylp, %)mod|C|~' 
= (20) ICZC'|~? exp| —4(y — Cp)'(CEC’)'(y~ Cp)| 
=n(ylCu,CEC’). 


Now let us consider two sets of random variables X,...,X, and 
Xgtirseees Xp forming the vectors 


XxX, xX 
(7) Xi) = ; XO- 
x, x, 


These variables form the random vector 


(8) X= | = 


Now let us assume that the p variates have a joint normal distribution with 
mean vectors 


(9) ExM= pw, EX = pw, 


24 LINEAR COMBINATIONS; MARGINAL DISTRIBUTIONS 25 


and covariance matrices 


(10) E(XO = pO) (XO ~ WO) =, 
(11) &(X%— p?)(x® — 2)! =, 
(12) &(XO = p)(X@ — p) = 3. 


We say that the random vector X has been partitioned in (8) into subvectors, 
that 


(13) p= 


Zn 2p 
(14) 5. | 


has been partitioned similarly into submatrices. Here £,, = %,. (See Ap- 
pendix, Section A.3.) 

We shall show that X® and X® are independently normally distributed 
if By. = 2B, =0. Then 


15 > at 8 
( ) ~ 0 S53. . 
Its inverse is 

pe 0 
16 Soa]! 
om | 0 33) 


Thus the quadratic form in the exponent of n(x| p, 2) is 


(17) Q=(x-p)'2"(x~ w) 
D\r 2 2 Li. 0 x ps 
aes D)?, (x - only lees 4 


Pry pe 


— 
= (x) = p) E52 — pw) + (x = wl?) Ez (re - pw) 
a Q, + Q>, 


Pere 7 — 
= [(x® — POV Si, (xO = p)'s zi[re 2] 


26 THE MULTIVARIATE NORMAL DISTRIBUTION 


say, where 


(18 Q, = (20 WY Eit(x ~ WP), 


Q,= (x = OVE (x = ai 


Also we note that |%| =|2,,|-|2..|. The density of X can be written 


1 1 
(19 n( x| pp, &) = ———e" 
) (x1 pm, 2) (ny "lSI: 
Ba aegis 
(2m) (am) M1Z AI 


= n(xo| pd, Sa )n(x| pw oy x»). 


The marginal density of X is given by the integral 


(20) foo fo nie, 2) deg, de, 


x x 

=n(x[p, E,) f aa n(x |p, Za) de 4) 1 de, 
—0 -0 

=n(x'?|p, %,,). 


Thus the marginal distribution of X" is N(w?, &,,); similarly the marginal 
distribution of X® is N(w, X..). Thus the joint density of X,,..., X, is the 
product of the marginal density of X,,...,.X, and the marginal density of 
Xq+1++++.X,, and therefore the two sets of variates are independent. Since 
the numbering of variates can always be done so that X™ consists of any 
subset of the variates, we have proved the sufficiency in the following 
theorem: 


Theorem 2.4.2. If X,,...,X » have a joint normal distribution, a necessary 
and sufficient condition for one subset of the random variables and the subset 
consisting of the remaining variables to be independent ts that each covariance of 
a variable from one set and a variable from the other set is 0. 


2.4 LINEAR COMBINATIONS; MARGINAL DISTRIBUTIONS 27 


The necessity follows from the fact that if X, is from one set and X, from 
the other, then for any density (see Section 2.2.3) 


(21) oj, = €(X, — w,)(X, ~ 4) 


= ff Cum ay afte) 
f( Xqais cers Xp) ax, “de, 


= i- oe (ie ~ phy) f(y5-0+5%q) dey > de, 


f ae fam mF gered) aXe. de, 


= 0. 


Since oj; = 0:0; p,,, and o,, a, # 0 (we tacitly assume that & is nonsingular), 

the condition o,,=0 is equivalent to p,,= 0. Thus if one set of variates is 
uncorrelated with the remaining variates, the two sets are independent. It 
should be emphasized that the implication of independence by lack of 
correlation depends on the assumption of normality, but the converse is 
always true. 
. Let us consider the special case of the bivariate normal distribution, Then 
XM =X, XPHX, paw, pP= mw, Voy =), Yn = on = 97, 
and 3&4. = 22) = Op = 9,02 Pia. Thus if X, and X, have a bivariate normal 
distribution, they are independent if and only if they are uncorrelated. If they 
are uncorrelated, the marginal distribution of X, is normal with mean ya, and 
variance a,’. The above discussion also proves the following corollary: 


Corollary 2.4.1. Jf X is distributed according to N(w,%) and if a set of 
components of X is uncorrelated with the other components, the marginal 
distribution of the set is multivariate normal with means, variances, and covari- 
ances obtained by taking the corresponding components of . and %, respectively. 


Now let us show that the corollary holds even if the two sets are not 
independent. We partition X, p, and & as before. We shall make a 
nonsingular linear transformation to subvectors 


(22) YO =x + Rx®, 
(23) y® = x, 


choosing B so that the components of Y") are uncorrelated with the 


28 THE MULTIVARIATE NORMAL DISTRIBUTION 
components of Y° = X). The matrix B must satisfy the equation 
(24) 0= &(¥") - €Y)(Y@ - ey)’ 

= &(x + PX! ex — BEX)(X® = EX)! 


= &((x + &X) + B(X® = EX)|(X® es EX)’ 


= >i +BY». 
Thus B= —2%,,2%5,' and 
(25) ye aX SS phan XO 


The vector 


yy I -Xpkz! 
26 =Y= cites 
= aad 


is a nonsingular transform of X, and therefore has a normal distribution with 


yo) 1 5 y3! 
(27) é| | Se ee 
yo 0 I 
_{f£ -2pEn'}{w® 
- 0 I yp) 
_ [uP -SpEz ew?) [v® 
= py py) 
=P, 
say, and 


(28) @(Y)=é&(Y-v)(Y-v)' 


E(Y® = vO) yeD ae vy! &(y ae vO) (¥® = yy! 
é(y is y)/( yi) — yD)! é( y@) — y@)\(y -_ yy! 


an Ziy- ByBy! La 0 
0 er 


2.4 LINEAR COMBINATIONS; MARGINAL DISTRIBUTIONS 29 
since 
(29) (YP — pO) (YO — pO) 
= &[(X - p?) — 3. 2B (X® - 2)| 
[(XO = pO) — By BEX? - @yJ 
= 27 Lyx! Za —Lp2y Ly + LY i22n Ln Sz La 
=z — Zig Uy Za. 


Thus Y“ and Y® are independent, and by Corollary 2.4.1 X® =Y has 
the marginal distribution N(w, &,,). Because the numbering of the compo- 
nents of X is arbitrary, we can State the following theorem: 


Theorem 2.4.3. Jf X is distributed according to N(p, >), the marginal 
distribution of any set of components of X is multivariate normal with means, 
variances, and covariances obtained by taking the corresponding components of 
we and &, respectively. 


Now consider any transformation 
(30) Z= Dx, 


where Z has gq components and D is a q Xp real matrix. The expected value 
of Z is 


(31) &Z=Dyz, 
and the covariance matrix is 
(32) &(Z—Dy)(Z—Dp)'=D=D'. 


The case g =p and D nonsingular has been treated above. If g <p and D is 
of rank g, we can find a (p—q)Xp matrix E such that 


a ()-(2) 


is a nonsingular transformation. (See Appendix, Section A.3.) Then Z and W 
have a joint normal distribution, and Z has a marginal normal distribution by 
Theorem 2.4.3. Thus for D of rank g (and X having a nonsingular distribu- 
tion, that is, a density) we have proved the following theorem: 


30 THE MULTIVARIATENORMAL DISTRIBUTION 


Theorem 2.4.4. Jf X is distributed according to N(w,%), then Z = DX is 
distributed according to NCD, D> D'), where D is a q X p matrix of rank q <p. 


The remainder of this section is devoted to the singular or degenerate 
normal distribution and the extension of Theorem 2.4.4 to the case of any 
matrix D. A singular distribution is a distribution in p-space that is concen- 
trated on a lower dimensional set; that is, the probability associated with any 
set not intersecting the given set is 0. In the case of the singular normal 
distribution the mass is concentrated on a given linear set [that is, the 
intersection of a number of ( p — 1)-dimensional hyperplanes]. Let y be a set 
of coordinates in the linear set (the number of coordinates equaling the 
dimensionality of the linear set); then the parametric definition of the linear 
set can be given as x= Ay+A, where A is a pXq mattix and A is a 
p-vector. Suppose that Y is normally distributcd in the q-dimensional linear 
set; then we say that 


(34) X=AY+X 


has a singular or degenerate normal distribution in p-space. If @Y = v, then 
&X=Avt+rA=h, say. If €(¥-—vXY—-v)' =T, then 


(35) &(X-p)(X-p)' = €A(Y—v)(¥-v)'4' =ATA' =3, 


say. It should be noticed that if p > gq, then & is singular and therefore has 
no inverse, and thus we cannot write the normal density for X. In fact, X 
cannot have a density at all, because the fact that the probability of any set 
not intersecting the q-set is 0 would imply that the density is 0 almost 
everywhere. 

Now. conversely, let us see that if X has mean p and covariance matrix & 
of rank r, it can be written as (34) (except for 0 probabilities), where X has 
an arbitrary distribution, and Y of r (<p) components has a suitable 
distribution. lf % is of rank r, there is a p Xp nonsingular matrix B such 
that 


I 0 
3 , v 
« pxe'a(% °) 


where the identity is of order r. (See Theorem A.4.1 of the Appendix.) The 
transformation 


7 yoo 
(37) BX =V= en 


2.4 LINEAR COMBINATIONS; MARGINAL DISTRIBUTIONS 31 


defines a random vector V with covariance matrix (36) and a mean vector 


(1) 
(38) sv=Bu-»= (a) 


say. Since the variances of the elements of V™ are zero, V2 =v with 
probability 1. Now partition 


(39) B'=(C D), 
where C consists of r columns. Then (37) is equivalent to 


(1) 


(40) X=B"'V=({C ae 


| = CV + Dy, 


Thus with probability 1 
(41) X=CV + Dy, 


which is of the forns of (34) with C as 4, V"” as Y, and Dv as vr. 
Now we give a formal definition of a normal distribution that includes the 
singular distribution. 


Definition 2.4.1. A random vector X of p components with €X= ys and 
&(X — ph X— p)’ = is said to be normally distributed [or is said to be 
distributed according to N(w,%)] if there is a transformation (34), where the 
number of rows of A is p and the number of columns is the rank of 3%, sayr, and 
Y (of r components) has a nonsingular normal! distribution, that is, has a density 


(42) ke 2 WIT OU) 


It is clear that if % has rank p, then A can be taken to be J and A to be 
0; then X= Y and Definition 2.4,1 agrees with Section 2.3. To avoid redun- 
dancy in Definition 2.4.1 we could take T=J and v = 0. 


Theorem 2.4.5. If X is distributed according to N(w,%), then Z= DX is 
distributed according to N(Dw, D&D’). 


This theorem includes the cases where X may have a nonsingular or a 
singular distribution and D may be nonsingular or of rank less than q. Since 
X can be represented by (34), where FY has a nonsingular distribution 


32 THE MULTIVARIATE NORMAL DISTRIBUTION 
N(y»,T), we can write 
(43) Z=DAY+D,A, 


where DA is q Xr. Ifthe rank of DA is r, the theorem is proved. If the rank 
is less than r, say s, then the covariance matrix of Z, 


(44) DATA'D' = E, 


say, is of rank s. By Theorem A.4.1 of the Appendix, there is a nonsingular 
matrix 


F, 
(45) F= | F, 
such that 
i tae, F,EF FEF; 
is) | FEF FEF; 


(F\DA)T(F,DA)' (F,DA)T(F,DA)' 
(F,DA)T(F,DA)' (F,DA)T( Fy DA)' 


a Get 


Thus F, DA is of rank s (by the converse of Theorem A.1.1 of the Appendix), 
and F,DA=0 because each diagonal element of (F,DA)T(F,DA)’ is a 
quadratic form in a row of F, DA with positive definite matrix 7, Thus the 
covariance matrix of FZ is (46), and 


F,D 


0 + FD, 


(47) F2= § 


1 
2 


DAY + FD = | 


FD = o 
ieee 


say. Clearly U, has a nonsingular normal distribution. Let F-'=(G, G,). 
Then 


(48) Z=G,U,+Dn, 
which is of the form (34). a 


The developments in this section can be illuminated by considering the 
geometric interpretation put forward in the previous section. The density of 
X is constant on the ellipsoids (54) of Section 2.3. Since the transformation 
(2) is a linear transformation (i.e., a change of coordinate axes), the density of 


2.5 CONDITIONAL DISTRIBUTIONS; MULTIPLE CORRELATION 33 


Y is constant on ellipsoids 


(49) (y— Cp)'(CZC’)'(y — Cp) =k. 


The marginal distribution of X" is the projection of the mass of the 
distribution of X onto the q-dimensional space of the first g coordinate axes. 
The surfaces of constant density are again ellipsoids. The projection of mass 
on any line is normal. 


2.5. CONDITIONAL DISTRIBUTIONS AND MULTIPLE 
CORRELATION COEFFICIENT 


2.5.1. Conditional Distributions 


In this section we find that conditional distributions derived from joint 
normal distribution are normal. The conditional distributions are of a partic- 
ularly simple nature because the means depend only linearly on the variates 
held fixed, and the variances and covariances do not depend at all on the 
values of the fixed variates. The theory of partial and multiple correlation 
discussed in this section was originally developed by Karl Pearson (1896) for 
three variables and exiended by Yule (1897+, 1897b). 

Let X be distributed according to N(p, %) (with = nonsingular). Let us 
partition 


a) 
(1) X= i | 


as before into g- and (p — q)-component subvectors, respectively. We shall 
use the algebra developed in Section 2.4 here. The joint density of Y" =x ? 
i Dy Tt xX and y® = x iS 

n(y®| pO — ZX 2 Bz pO, By ~ Lag Zaz Za aly, Ey). 


The density of X and X® then can be obtained from this expression by 
substituting x“? — ¥),Zz)x for y? and x® for y™ (the Jacobian of this 
transformation being 1); the resulting density of X“ and X is 


(2) 


1 
f(2, x) = ——— oxp[- Alesw — p) -— 3,55 (x - pw) 
(27) *¥y |X yal 
“STE [ (2 — wi?) - By BB (x - w)]} 


1 a 
to er 5(x — w)'E 3} (x — p)], 


34 THE MULTIVARIATENORMAL DISTRIBUTION 


where 
(3) Bane = By — By Ba! Za. 


This density must be (xl, %). The conditional density of X“? given that 
X® =x! is the quotient of (2) and the marginal density of X® at the point 
x, which is n(x@| p®, Z 29), the second factor of (2), The quotient is 


(4) 


f(AM]x2) = aa" = =e exp{- LE (x — pi?) - Ep Sn (x? - p)] 


Sqh[ (x — pO) - Lp Lp (x - >). 


It is understood that x consists of p—q numbers. The density f(x‘?|x™) 
is a q-variate normal density with mean 


(4) 8( XMM) = pO + ELH (x = 2) ua v(x), 


Say. and covariance matrix 
(6) é([ xe! = v(x)| [x oF v(x)| alee) = 22 = hy Bighy Za. 


It should be noted that the mean of X“) given x@ is simply a linear function 
of x, and the covariance matrix of X“ given x® does not depend on x 
at all. 


Definition 2.5.1. The matrix B = %,.% 3) is the matrix of regression coef- 
ficients of X™ on x, : 

The element in the ith row and (k — q)th column of B= %,.%3,' is often 
denoted by 


(7) Bid 2s a eb 9 i=1,...,.q, k=qtl,...,p. 


The vector p!" + B(x) — pr) is called the regression function. 
Let o,,.1, ., be the 7, jth element of 241... We call these partial 
COVAMANCES, On a4, .p 18 a partial variance. 


Definition 2.5.2 


(8) P, Coy 7 ys. Lent ds 
tke 7 Vo Fr gtt, 3 pV Tyg 


is the partial correlation between X, and X, holding X,,,,...,X, fixed. 


2.5 CONDITIONAL DISTRIBUTIONS, MULTIPLE CORRELATION 35 


The numbering of the components of X is arbitrary and q is arbitrary. 
Hence, the above serves to define the conditional distribution of any gq 
components of X given any other p—q components. In the case of partial 
covariances and correlations the conditioning variables are indicated by the, 
Subscripts after the dot, and in the case of regression coefficients the 
dependent variable is indicated by the first subscript, the relevant condition- 
ing variable by the second subscript, and the other conditioning variables by 
the subscripts after the dot. Further, the notation accommodates the condi- 
tional distribution of any q variables conditional on any other r — q variables 
(q<r<p). 


Theorem 2.5.1. Let the components of X be divided into two groups com- 
posing the subvectors X‘") and X, Suppose the mean wp is similarly divided into 
ew and pw, and suppose the covariance matrix & of X is divided into 
Zip. % 122 La, the covariance matrices of X, of XMand X®, and of x, 
respectively, Then if the distribution of X is normal, the conditional distribution of 
XY given X% =x is normal with mean p? +31. Zz) (x — pw) and 
covariance matrix X 1, — X12 Xx Xo. 


As an example of the above considerations let us consider the bivariate 
normal distribution and find the conditional distribution of X, given X, =x. 
In this case wp = wy, p® = wy, By, = 07, X= 9102 p, and Xy = of. Thus 
the 1x1 matrix of regression coefficients is 2.2 y = 9, p/a2, and the 
1X 1 matrix of partial covariances is 


(9) Zune 2u- YX Lay = of a ofa} p?/az = of (1 = p). 


The density of X, given x2 is n[x,lu, +(o, p/o,x2— My), 201 — p”)). 
The mean of this conditional distribution increases with x, when p is 
positive and decreases with increasing x, when p is negative. It may be 
noted that when o, = a2, for example, the mean of the conditional distribu- 
tion of x, does not increase relative to 4, as much as x, increases relative to 
2. [Galton (1889) observed that the average heights of sons whose fathers’ 
heights were above average tended to be less than the fathers’ heights; he 
called this effect “regression towards mediocrity.”] The larger |p| is, the 
smaller the variance of the conditional distribution, that is, the more infor- 
mation x, gives about x,. This is another reason for considering p a 
measure Of association between X, and X>. 

A geometrical interpretation of the theory is enlightening. The density 
f(x,, £2) can be thought of as a surface z = f(x,, x) over the x,, x,-plane. If 
we intersect this surface with the plane x, =c, we obtain a curve z=f(x,,c) 
over the line x,=c in the x,,x,-plane. The ordinate of this curve is 


36 THE MULTIVARIATE NORMAL DISTRIBUTION 


proportional to the conditional density of X, given x,=c; that is, it is 
proportional to the ordinate of the curve of a univariate normal distribution. 
In the more general case it is convenient to consider the ellipsoids of 
constant density in the p-dimensional space. Then the surfaces of constant 
density of f(x,,...,%lCj41,-+-,¢,) are the intersections of the surfaces of 
constant density of f(x,,...,x,,) and the hyperplanes x,4) =Cy415.-+,%)= 
c,; these are again ellipsoids. 

Further clarification of these ideas may be had by consideration of an 
actual population which is idealized by a normal distribution. Consider, for 
example, a population of father-son pairs. If the population is reasonably 
homogeneous, the heights of fathers and the heights of corresponding sons 
have approximately a normal distribution (over a certain range). A condi- 
tional distribution may be obtained by considering sons of all fatuers whose 
height is, say, 5 feet, 9 inches (to the accuracy of measurement); the heights 
of these sons will have an approximate univariate normal distribution. The 
mean of this normal distribution will differ from the mean of the heights of 
sons whose fathers’ heights are 5 feet, 4 inches, say, but the variances will be 
about the same. 

We could also consider triplets of observations, the height of a father, 
height of the oldest son, and height of the next oldest son. The collection of 
heights of two sons given that the fathers’ heights are 5 feet, 9 inches is a 
conditional distribution of two variables, the correlation between the heights 
of oldest and next oldest sons is a partial correlation coefficient. Holding the 
fathers’ heights constant eliminates the effect of heredity from fathers; 
however, one would expect that the partial correlation coefficient would be 
positive, since the effect of mothers’ heredity and environmental factors 
would tend to cause brothers’ heights to vary similarly. 

As we have remarked above, any conditional distribution obtained from a 
normal distribution is normal with the mean a linear function of the variables 
held fixed and the covariance matrix constant. In the case of nonnormal 
distributions the conditional distribution of one set of variates on another 
does not usually have these properties. However, one can construct nonnor- 
mal distributions such that some conditional distributions have these proper- 
ties. This can be done by taking as the density of X the product n[x“?| ya? + 
Box — wp), &,2) fr), where f(x) is an arbitrary density. 


2.5.2. The Multiple Correlation Coefficient 


We again consider X partitioned into X and X®. We shall study some 
properties of BX”). 


2.5 CONDITIONAL DISTRIBUTIONS; MULTIPLE CORRELATION 37 


Definition 2.5.3. The vector X29 = XY — pO — BUX — yw) is the vec- 
tor of residuals of X“ from its regression on X®, 


Theorem 2.5.2. The components of X“ are uncorrelated with the compo- 
nents of X., 


Proof. The vector X“"2) is ¥) — @¥ in (25) of Section 2.4. a 


Let oj, be the ith row of £4, and Pj,, the ith row of B (ie. Bi) = 
On). Let VEZ) be the variance of Z. 


Theorem 2.5.3. For every vector a 
(10) V( XO) < ¥(X,-— aX), 
Proof. By Theorem 2.5.2 
(11) V(X;- a'X®) 
= &[X,- pj- a'(X® - py]? 
= 8[ Xi — Ex! + (By) — a) (X® - my 
= V[XP?] + (By ~ a) &(X® — p)(X® ~ wi?) (Bay @) 
= V( XP?) + (By — &)'Z22(Byy ~ &). 


Since %,, is positive definite, the quadratic form in B,,— a is nonnegative 
and attains its minimum of 0 at a= B,;. a 


Since PX"? = 0, VOX!) = €CXE)?. Thus py; + By X? — pw) is the 
best linear predictor of X, in the sense that of all functions of X of the form 


a'X +c, the mean squared error of the above is a minimum. 


Theorem 2.5.4. For every vector a 
(12) Cort( X;, By) X) = Corr( X;,a'X). 


Proof. Since the correlation between two variables is unchanged when 
either or both is multiplied by a positive constant, we can assume that 


38 THE MULTIVARIATE NORMAL DISTRIBUTION 
Ela'(X — p)P = &[ Bi, (X — pw). Then the expansion of (10) is 
(13) 2E(X,— H,)Bi(X? — H) + YB) X) 

So, —-2E(X,— wa (X® — p) + Hat X), 
This leads to 


é(X,- 1, Bi ( X = 2) &(X,~ pj) 0'(X® = ()) 


V Fi V¥ (Bi) X°) : V ay V(a'X®) 


Definition 2.5.4. The maximum correlation between X, and the linear com- 
bination «'X) is called the multiple correlation coefficient between X, and X. 


(14) 


It follows that this ts 


(15) R EB X? ae pi) ( X, = M;) 


, -!| U -t 
_ O(1)2 20 Gu) ” V Oi) D7) 
~-4 . 
VF ¥ Fy X22 ad 


A useful formula is 


med 
(16) es R ee Ty O22 Foy = |=, 
vgt+tl.. .p g,; g,;|Z2| ? 


where Theorem A.3.2 of the Appendix has been applied to 


oj, () 
17 x= ; 
a7) & si 
Since 
(18) Tregtt,..p In G1) 237 Fi)» 


it follows that 


= p2 
(19) Tyegel,. pw (a Regan... -p) 8 
This shows incidentally that any partial variance of a component of X cannot 
be greater than the variance. In fact, the larger R,.,.;,...,p i8, the greater the 


2.5 CONDITIONAL DISTRIBUTIONS; MULTIPLE CORRELATION 39 


reduction in variance on going to the conditional distribution. This fact is 
another reason for considering the multiple correlation coefficient a measure 
of association between X, and X®. 

That Bi,,X® is the best linear predictor of X; and has the maximum 
correlation between X; and linear functions of X® depends only on the 
covariance structure, without regard to normality. Even if X does not have a 
normal distribution, the regression of X® on X® can be defined by 
po +> Bax — pl); the residuals can be defined by Definition 2.5.3; 
and partial covariances and correlations can be defined as the covariances 
and correlations of residuals yielding (3) and (8). Then these quantities do 
not necessarily have interpretations in terms of conditional distributions. In 
the case of normality 4, + Byy(x® — p) is the conditional expectation of X; 
given X® = x, Without regard to normality, X,— £X,|X@ is uncorrelated 
with any function of X%, &X,|X® minimizes &LX,—h(X)P with respect 
to functions h(X) of X®, and &X,|X® maximizes the correlation between 
X; and functions of X. (See Problems 2.48 to 2.51.) 


2.5.3. Some Formulas for Partial Correlations 


We now consider relations between several conditional distributions odtained 
by holding several different sets of variates fixed. These relations are useful 
because they enable us to compute one set of conditional parameters from 
another set. A very special case is 


(20) P12 P13 P23 


P12.3 ~ Vizpayi=en. 


this follows from (8) when p = 3 and g= 2. We shall now find a generaliza- 
tion of this result. The derivation is tedious, but is given here for complete- 
ness. 


Let 
xX) 
(21) X= | x® 
xX? 


where X is of p, components, X® of p, components, and X® of p, 
components. Suppose we have the conditional distribution of X® and xX 
given X® =x; how do we find the conditional distribution of X") given 
X =x and XO) = xO? We use the fact that the conditional density of X 


40 THE MULTIVARIATENORMAL DISTRISUTION 


given X® =x and X98) =x is 
f(x, xi?) x2) 
(22 FOOL, x) = FEE Oh) 
) , ) f( x?) x) 


ae x | x) /f(x) 
f(x, x9) f(x) 


FC, xx) 


ff x|x0) 


In the case of normality the conditional covariance matrix of X° and x 


giver X® =. x9) is 
Si Sel. (Sake 
2°) i =| = Babes %32) 


_{2us 2p 
Zon3 2223 
say, where 
Zn 2 Zz 
(24) = La Dry) X23 
23 Xa 233 


The conditional covariance of X" given X® =x and X° =x®) is calcu- 
lated from the conditional covariances of X“ and X® given X® =x® as 


(25) 6 (AOL, x] aa 2113 = %12.3( 22.3) - 2a13: 


This result permits the calculation of o,,,41,...)) bJ=1....P1, from 
Tijpy+pyeccesp? Ord = lees) Pi + Pa: 
In particular, for p,; =q, pz =1, and p; =p —q-—1, we obtain 


(26) a — - Oj gttqt2..sp I qtlqt2.euP 
if- | neers 77 f-g4+2,..., ’ 
pqt P ‘fq P Oy t gti qt2yuP 
i,f=,..y. 
Since 
ot 2 
(27) Ong +1 mae ee Pp On-g4+2 Supe: HIP aes 5.4) 


2.6 THE CH4 RACTERISTIC FUNCTION; MOMENTS 41 
we Obtain 


Prpeg42..e., ~ Preqtt: Soy Phogtlqt2.ics 
(28) _ P q phi. P 


Pipgtliop ae, aan 
exe j 2 2 
ad eee res ee P= 97 4 tog *2, 1p 


This is a useful recursion formula to compute from {,,} in succession 
{ Pips { Pijp-t, ois seey P12-3,...,p° 
2.6. THE CHARACTERISTIC FUNCTION; MOMENTS 


2.6.1. The Characteristic Function 


The characteristic function of a multivariate normal distribution has a form 
similar to the density function. From the characteristic function, moments 
and cumulants can be found easily. 


Definition 2.6.1. The characteristic function of a random vector X is 
(1) o(t) = &et* 
defined for every real vector t. 


To make this definition meaningful we need to define the expected value 
of a complex-valued function of a random vector. 


Definition 2.6.2. Let the complex-valued function g(x) be written as g(x) 
= g,(«) + igo(x), where g(x) and g(x) are real-valued. Then the expected value 
of g(X) is 


(2) &g(X) = &e(X) +i &g2(X). 
In particular, since e’® = cos @+isin 8, 
(3) &e'* = Scost’X+i€ sint’X. 


To evaluate the characieristic function of a vector X, it is often convenient 
to use the following lemma: 


Lemma 2.6.1. Let X’ = (XxX), If X and X® are independent and 
g(x) = g(x) g(x), then 


(4) &g(X) = &g(X®) &g(X), 


42 THE MULTIVARIATE NORMAL DISTRIBUTION 


Proof. If g(x) is real-valued and X has a density, 
() €s(X)y= fof ef) ay ay, 
= ett fe gO F 2) (9 ZO yas, de, 
= le ids ie (2) FOXY) dey de, 


ag % 47 (9 
fo fo g(x?) (2) dx, + de, 
= Eg) (XM) & Q(X), 
If g(x) is complex-valued, 
(6) a(x) = [g(x?) + ig(e)] [ 22x) + ig?(x)] 
= g{P(x) ge) — ghP (x) g(x) 
if g0( 2) 21( 22) + 2(x) P(x]. 
Then 
(7) By) = 8 9f( 1) aP(X) — PK) 2P(X)] 
+i 8{ P(X) ge XO) + gi XM) BP X)| 
~ 8 94(X) Bg X) ~ Bgl XO) BoP 
+ if &ghP( XM) Bg P(X) + Eg\P( XP) EgP'(X®)] 
= [ egi( x) 4 i&g(X)|[ &g(X) + i &g\P(X)] 
= &g( XM) Eg? XM), aw 
By applying Lemma 2.6.1 successively to g(X) = e*, we derive 


Lemma 2.6.2. [f the components of X are mutually independent, 
(8) e ‘XK -1] ge x 


We now find the characteristic function of a random vector with a normal 
distribution. 


2.6 THE CHARACTERISTIC FUNCTION; MOMENTS 43 


Theorem 2.6.1. The characteristic function of X distributed according to 
N(w, &) is 


(9) o(t) os Geit x i eit'B- yrds 
for every real vector t. 


Proof. From Corollary A.1.6 of the Appendix we know there is 4 nonsingu- 
lar matrix C such that 


(10) C'S "Cel, 

Thus 

(11) Sse oC re(00", 
Let 

(12) X-p=Cy. 


Then F¥ is distributed according to N(0, 2). 
Now the characteristic function of ¥ is 


Pp 
(13) w(u) = be? = T] se, 
j=1 


Since Y, is distributed according to N(0, 1), 


p 
(14) w(u) = Rese ese 
i= 
Thus 
(15) p(t) = felt X — Fett (C¥+p) 
=elt’u geil’ cr 


=: gl’ Bp > HCKtC) 


for t'C =u’; the third equality is verified by writing both sides of it as 
integrals. But this is 

(16) $(t) 2: ei Bee CC" 

ip ile 


mt 


by (11). This proves the theorem. | 


44 THE MULTIVARIAT= NORMAL DISTRIBUTION 


The characteristic function of the normal distribution is very useful. For 
example, we can use this method of proof to demonstrate the results of 
Section 2.4. If Z = DX, then the Characteristic function of Z is 


(17) Sele = Fei DX = gelD'x 


rer 
= pb)" Hp)" S(Dt) & 


= eit'(De)- (DI D’)t 
3 


which is the characteristic function of N( Du, D&D‘) (by Theorein 2.6.1), 

It is interesting to use the characteristic function to show that it is only the 
multivariate normal distribution that has the property that every linear 
combination of variates is normally distributed. Consider a vector Y of p 
components with density f(y) and characteristic function 


(18) wu) = Sew m foo fo eM Fy) dy dp, 


and suppose the mean of Y is w« and the covariance matrix is £. Suppose u'Y 


is normally distributed for every u. Then the characteristic function of such 
linear combination is 


(19) getw'y — pitu'p- Hu Lu 


Now set t = 1. Since the right-hand side is then the characteristic function of 
N(w, =), the result is proved (by Theorem 2.6.1 above and 2.6.3 below). 


Theorem 2.6.2.  /f every linear combination of the components of a vector Y 
is normally distributed, then Y is normally distributed. 


It might be pointed out in passing that it is essential that every linear 
combination be normally distributed for Theorem 2.6.2 to hold. For instance, 
if Y=(Y,,Y,)’ and Y, and Y, are not independent, then Y, and Y, can each 
have a marginal normal distribution. An example is most easily given geomet- 
rically. Let X,,.X, have a joint normal distribution with means 0. Move the 
same mass in Figure 2.1 from rectangle A to C and from B to D. It will be 
seen that the resulting distribution of Y ts such that the marginal distribu- 
tions of Y, and Y, are the same as X, and X,, respectively, which are 
normal, and yet the joint distribution of Y, and Y, is not normal. 

This example can be used also to demonstrate that two variables, Y, and 
Y,, can be uncorrelated and the marginal distribution of each may be normal, 


2.6 THE CHARACTERISTIC FUNCTION; MOMENTS 45 


Figure 2.1 


but the pair need not have a joint normal distribution and need not be 
independent. This is done by choosing the rectangles so that for the resultant 
distribution the expected value of Y,Y, is zero. It is clear geometrically that 
this can be done. 

For future reference we state two useful theorems concerning characteris- 
tic functions. 


Theorem 2.6.3. Jf the random vector X has the density f(x) and the 
characteristic function $(t), ther 


(20) Aa) = oor <7 of eb(t) dt ~ 


This shows that the characteristic function determines the density function 
uniquely. If X does not have a density, the characteristic function uniquely 
defines the probability of any continuity interval. In the univariate case a 
‘continuity interval is an interval such that the cdf does not have a discontinu- 
ity at an endpoint of the interval. 


' Theorem 2.6.4. Let {F(x)} be a sequence of cdfs, and let {o,(t)} be the 
sequence of corresponding characteristic functions, A necessary and sufficient 
condition for F(x) to converge to a cdf F(x) is that, for every t, p(t) converges 
10 a limit o(t) that is continuous at t= 0. When this condition is satisfied, the 
limit @(t) is identical with the characteristic function of the limiting distribution 
F(x). 


For the proofs of these two theorems, the reader is referred to Cramér 
:(1946), Sections 10.6 and 10.7. 


46 THE MULTIVARIATE NORMAL DISTRIBUTION 


2.6.2. The Moments and Cumulants 


The moments of X|,..., X, with a joint normal distribution can be obtained 
from the characteristic function (9). The mean is 


f= 


~ 7\- nif, ia) (8) 


=o 
= Mh 
The second moment is 
2) #x,x=4-0% 
( ) Ae. i2 Ot, dt; = 
1 : : 
ere Voy yt, tin, ||- 2 Oy ,t, tip, | — o,;)O(t) 
k ’ k t=0 
= op, + py My. 
Thus 
(23) Variance( X;) = &(X,— w,)° = o% 
(24) Covariance(X,, X,) = &(X;— 4;)(X, — 4;) = 4, 


Any third moment about the mean is 

(25) &(X,~ ,)(X, ~ 4, )(X, ~ wy) = 9. 

The fourth moment about the mean is 

(26) &(X,- 4) (X%) — w)(X. — Hy) (4) - ay) = Oj; Fy + OY, Tj + Tp Ty. 
Every moment of odd order is 0. 


Definition 2.6.3. [f all the moments of a distribution exist, then the Cumu- 
lants are the coefficients « in 


(27) log @(t) = > ae ee 


2.7 ELLIPTICALLY CONTOURED DISTRIBUTIONS 47 


In the vase of the multivariate normal distribution K19...9 = My-«+> Ko... 01 


= My, Ky9 029 = Opps ses Kq.-.02 = Opps Kr10++-0 * Figee++- The cumulants for 
which Lis; > 2 are 0. 


2.7. ELLIPTICALLY CONTOURED DISTRIBUTIONS 


2.7.1. Spherically and Elliptically Contoured Distributions 


It was noted at the end of Section 2.3 that the density of the multivariate 
normal distribution with mean w and covariance matrix & is constant on 
concentric ellipsoids 


(1) (x= BYE" (x~ B) =k. 


A general class of distributions with this property is the class of elliptically 
contoured distributions with density 


(2) JAI ~?g[(x—v)'A7'(x- v)], 


where A isa positive definite matrix, g(-) = 0, and 


(3) fof sa'n dn dy, = 1. 


If C is a nonsingular matrix such that C’A~'C=J, the transformation 
x-—wv=Cy carries the density (2) to the density g(y’y). The contours of 
constant density of g(y’y) are spheres centered at the origin, The class of 
such densities is known as the spherically contoured distributions. Elliptically 
contoured distributions do not necessarily have densities, but in this exposi- 
tion Only distributions witl: densities will be treated for statistical inference. 


A spherically contoured density can be expressed in polar coordinates by 
the transformation 


(4) y; =rsin 6,, 
y2=1rcos 4, sin 4,, 


y3 =7 COS 6, COs 4, sin 43, 


Yp~1 =F COS 8, COs 8, «**cOs 6, sind», 


Yp = 1 COS 4, COSA, -**COS O., COS OH _;, 


48 THE MULTIVARIATENORMAL. DISTRIBUTION 
where ~3m< 859m, t=1,...,p—-2, -mw<@_,s 7, and 0<r<o, 
Note that y’y =r’. The jaeobiaa of the transformation (4) is 
r?~' cos?~76, cos?~*6, -+- cos 6,_,. See Problem 7.1. If g(y’y) is the density 
of Y, then the density of R, Oi j2003 O52 is 


(5) r?~! cos?~78, cos? 78, +++ cos 0,-28(T*). 


Note that R,@,, oy Opi are independently distributed. Since 


m/2 ery | ee | pa CGAIG 
(6) fogs edd eer (he) 


(Problem 7.2), the marginal density of R is 


(7) C(p)a(r?)r?t, 
where 
(8) 
27? 
C(p) = Tp) 


z 
S i‘. i Ee ce cos?~6, cos?~3A, ++ cos 8... 6, ---d0,., dO, 
—m’ -1/2 ~af2 


The marginal density of @, is I[#(p —i)lcos’™' 'e/(TG 14 p —i- DY), 
i=1,...,p—2, and of 6,_, is 1/(27). 
In the normal case of N(0, J) the density of Y is 


g(y'y) = (27)? exp(—4y’y), 


and the density of R =(¥’Y)? is r?~! exp(— 4r?)/[2 #"' Cp) ]. The density 
of r2 =v is vp? ~!e »/(2¥T(Lp)). This is the y*-density with p degrees of 
freedom. 

The constant C(p) is the surface area of a sphere of unit radius in p 
dimensions. The random vector U with Coordinates sin ©,,cos ©, sin @,,..., 
cos @; Cos @, «cos @,_,, where ©,,...,@,_, are independently distributed 
each with the uniform distribution over (— 7/2, 7/2) except for 0,_, having 
the uniform distribution over (— 7, 7), is said to be uniformly distributed on 
the unit sphere. (This is the simplest example of a spherically contoured 
distribution not having a density.) A stochastic representation of Y with thg 


2.7 ELLIPTICALLY CONTOURED DISTRIBUTIONS 49 


density g(y’y) is 


(9) y = Ru, 
where R has the density (7). 
Since each of the densities of @,,...,0,_, are even, 
(10) é€U=0. 
Because R and U are independent, 
(11) EY=0 
if @R < oo. Further, 
(12) €YY' = €R’ &UU' 
if &R?<oo. By symmetry &U?P= --- = €U?=1/p because Lf_,U? = 1. 


Again by symmetry &U,U, = &U,U, = ++ = €U,_,U,. In particular &U,U, 
= & sin ©, cos @, sin @,, the integrand of which ts an odd function of @, and 
of @,. Hence, UU = 0, i #j. To summarize, 


(13) &UU' = (1/p)E, 
and 

(14) EYY' = (1/p) €R, 
(if &R? < 00). 


The distinguishing characteristic of the class of spherically contoured 
distributions is that OY £Y for every orthogonal matrix O. 


Theorem 2.7.1. If Y has the density g(y'y), then Z = OY, where O'O =I, 
has the density g(z'z). 


’ Proof. The transformation z = Oy has Jacobian 1. a 


We shall extend the definition of Y being spherically contoured to any 
distribution with the property OY £ Y. 


Corollary 2.7.1. If Y is spherically contoured with stochastic representation 
Y £ RU with R? =Y'Y, then U is spherically contoured. 


Proof, If Z= OY and henze ZY, and Z has the stochastic representa- 
tion Z= SV, where S?=Z'Z, then § = R and V= OU £U. a 


50 THE MULTIVARIATE NORMAL DISTRIBUTION 


The density of X=v+CY is (2). From (11) and (14) we derive the 
following theorem: 


Theorem 2.7.2. [f X has the density (2) and &R? < co, 
(15) @X=p=v, €(X)=6(X-p)(X-p)'=E(1/p) ERA. 


In fact if &R™ <co, a moment of X of order A (<m) is &(X,—- w,)"- 
(X, — Myo = EZ + Ze ER" /S( x2)", where Z has the distribution 
(0,3) and h=h, ++ Fy. 


Theorem 2.7.3. If X has the density (2), &R?<0o, and flc@(X)|= 
flE(X)] for all c > 0, then fL@OXN=f(E). 

In particular p,(X) = jf FG) = Ay,/ Yr; where & = (a;,) and A= 
(A,). 


2.7.2. Distributions of Linear Combinations; Marginal Distributions 


First we consider a spherically contoured distribution with density g(y’y). 
Let y’ =(y',y,), where y, and y, have g and p—q components, respec- 
tively. The marginal density of y, is 


tee) 


(16) Ve e J siiy +Y)¥2) dy + dy,. 


Express y, in polar coordinates (4) with r replaced by r, and p replaced by 
q. Then the marginal density of y, is 


(17) Ba(¥2¥2) =C(a) | a(re + ¥¥0 rE? dr. 


This expression shows that the marginal distribution of y, has a density 
which is sphcrically contoured. 

’ Now consider a vector X’ = (X ’, X®)") with density (2). If &R? <0, 
the covariance matrix of X is (15) partitioned as (14) of Section 2.4. 
Let Z =X - 3 85x = XO -~ AL AGX®, Z=xX@, pHa 
vO — YF Lav =v — A, AZ v®, + =v, Then the density of Z’ = 
(Z! 72") ig 


(18) LA gol Aggl 7 Te[ (20 — a D)"A an( 2 = a) 


a ( 7 — y)) "Nay 7 — v)] : 


2.7 ELLIPTICALLY CONTOURED DISTRIBUTIONS §1 


Note that Z and Z® arc uncorrelated cven though possibly dependent. 
Let C, and C, be q Xq and (p —q) X(p —q) matrices satisfying C, Ajj.C; 
=I, and C,A3}C,=I,.,. Define y™ and y™ by 29-1 =C,y® and 
2% — »@ =C,y®. Then ¥™ and ¥® have the density g(y’y™ + y@’y®), 
The marginal density of Y is (17), and the marginal density of X = Z® igs 


(19) Az, 3g, [ (2 — vA GI (x — v)] 
“i ! 1 
= C(a) f glr? + (x9 — wy AT (© — yl] pa- dr,. 


The moments of Y, can be calculated from the moments of Y. 

The generalization of Theorem 2.4.1 to elliptically contoured distributions 
is the following: Let X with p components have the density (2). Then Y= CX 
has the density |CAC’|~ igl(x ~ Cv) (CAC’)"!(x— Cv)] for C nonsingular. 

The generalization of Theorem 2.4.4 is the following: If X has the density 
(2), then Z = DX has the density 


(20) |DAL"|~'g,|[(z- Dv)'(DAD')"'(z—Dv)], 
where D is a q Xp matrix of rank q <p and g, is given by (17). 


We can also characterize marginal distributions in terms of the represen- 
tation (9). Consider 


(1) (1) 
(21) v=[7 | £rv=a(¥ E 
where ¥Y®) and U“) have q components and Y® and U® have p—q 
components. Then R3 = Y™’Y®) has the distribution of R°U®’U™, and 


Uy P YY 'y®) 


()rp7(2) . 2 


In the case ¥ ~ N(O, £,), (22) has the beta distribution, say BC p — q,q), with 
density 


l"{ p/2) eee, ee 
(23) Tapltip-a7l PII 4, “OS2e. 


Hence, in general, 


(24) YO ZR, 


where R32 £R2b, b~ B(p —q,q), V has the uniform distribution of v’v = 1 
in p, dimensions, and R*, b, and V are independent. All marginal distribu- 
tions are elliptically contoured. 


2.7.3. Conditional Distributions and Multiple Correlation Coefficient 


The density of the conditional distribution of y, given y, when y = (yj, y5)’ 
has the spherical density g(y’y) is 


gC VI t¥o¥2) _ BOY +73) 
(25) ’ ae 2 ’ 
82( ¥2¥2) 8o(73) 


where the marginal density g,(y5y.) is given by (17) and r>} =y5y,. In terms 
of y,, (25) is a spherically contoured distribution (depending on 73). 

Now consider X = (X{, X35)’ with density (2). The conditional density of 
X given X@ =x is % 


(26) 


# 
[A gyal Fg([(x@ — v9! = (x ~ vy’ B AGEL x — vO — B(x — vp] 
+ (x = vy A3} (x Se y)} 
B(x — v7 AZ (4% — v)] 
= 1 Aaya Fel = v® = Bor — v@P ATL [xO — vO — BEx® — ¥)] +73} 
+ 8o(r3), 
where rz = (x — v)' AZ} (x — vp) and B =A, A3}. The density (26) is 


elliptically contoured in x“ ~ y® — B(x@ — vy) as a function of x. The 
conditional mean of X" given X® =x is 


(27) E( Xx) = vO + B(x — v) 


if &( Rely; y. =1r5) < co in (25), where R? = Y/Y,. Also the conditional covari- 
ance matrix is (@7r3/q)Aj,.. It follows that Definition 2.5.2 of the partial 
correlation coefficient holds when (0;,.5,1,...)) = ua = 2+ 2y2®R La 
and & is the parameter matrix given above. 

Theorems 2.5.2, 2.5.3, and 2.5.4 are true for any elliptically contoured 
distribution for which &R* <0. 


2.7.4. The Characteristic Function; Moments 


The characteristic function of a random vector Y with a spherically con- 
toured distribution ¢e’” has the property of invariance over orthogonal 


2,7 ELLIPTICALLY CONTOURED DISTRIBUTIONS 53 


transformations, that is, 


(28) erm fam f erg y'y) dy - dy, 


wo 


in) 
= ef e!! *e(z'z) dz, s+ dz 
—-% —-B 


P 
= in'Z 
= Get". 


where Z = OY also has the density g(y’y). The equality (28) for all orthogo- 
nal O implies &e" is a function of t’t. We write 


(29) Set % = d(t’'t). 
Then for X = p + CY 
(30) bt uiX eit’h gelt’cy 
= e!*$(t'CC't) 
=e Mb(t'AL) 
when A=CC’. Conversely, any characteristic function of the form 
e'*d(t' At) corresponding to a density corresponds to a random vector X 
with the density (2). 
The moments of X with an elliptically contoured distribution can be 
found from the characteristic function e“ "¢(t’Zt) or from the representa- 


tion X= p + RCU, where C’A~'C =L. Note that 


‘ 


(31) SR =C(p) ['r?*"9(1") dr = —2p¢'(0), 
(32) SR‘ =C(p) f “rP+3g(r2) dr = 4p( p +2) 6" (0). 


_ Consider the higher-order moments of Y= RU. The odd-order moments 
of R are 0, and hence the odd-order moments of ¥ are 0. 
We have 


(33) &(X;~ wi) (Xj — By) Xe — be) = 0- 


Tn fact, all moments of X — ys of odd order are 0. 
Consider &U,U,U,U,. Because U’U = 1, 


P 
(34) L= ¥ U7U2=péus +p(p— 1) 8UZU2. 
i,j=l 


54 THE MULTIVARIATE NORMAL DISTRIBUTION 


Integration of @sin'O, gives SU'=3/[p(p+2); then (34) implies 
&UlUS =1/[p(p +2). Hence SY, =38R*/[p(p+2)\ and &Y7Y?= 
ER* A p(p+2. Unless i=j=k=! or i=j#k=l or i=k#j=! or 
i=l#j=k. we have €UUU,U,=0. To summarize @U,U;U,U, = (8,8; + 
6,51 + 8,5,)/[ p(p + 2)} The fourth-order moments of X are 


(35) &(X,- w,)(X, — 2, )CX, — By) X — by) 


& R* 
= Bp 42) rue + Ay Ay t AuAj) 
_ a Spe. 


Dad (Gy Tur t Oe Gy + Fy Fx): 


The fourth cumulant of the ith component of X standardized by its 
standard deviation is 


ES sa |3 so 


say. This is known as the kurtosis. (Note that « is 4&((X,—-p,)*/ 
[8(X,- w)°P}-1.) The standardized fourth cumulant is 3x for every 
component of X. The fourth cumulant of X,, X;, X,, and X;, is 


(37) 
Kear = ECX, — CX, — CX, Ma VX By) — (0S + OG + FG) 
= K( 0,04, + O45) + Oy, ). 
For the normal distribution « = 0. The fourth-order moments can be written 
(38) E(X,— w,)(X, eee 
= (1+ «)( 0, O41 + O¢ Fy + Fi %e)- 


More detail about elliptically contoured distributions can be found in Fang 
and Zhang (1990). 


2.7 ELLIPTICALLY CONTOURED DISTRIBUTIONS 55 


The class of elliptically contoured distributions generalizes the normal 
distribution, introducing more flexibility; the kurtosis is not required to be 0. 
The typical “bell-shaped surface” of |Al~ ?g{(x—v)’A7'(x—v)] can be 
more or less peaked than in the case of the normal distribution. 1n the next 
subsection some examples are given. 


2.7.5, Examples 


(1) The multivariate t-distribution. Suppose Z ~ N(0, I,), ms? £y2, and Z 
and s* are independent. Define Y=(1/s)Z. Then the density of Y is 


mt+p mtp 
(39) EEL a 
r() mene m 
and 
2 2 2 
(40) REL WI’ we Re 


If X = 2 + CY, the density 0” X is 


r( me 
re . z) 


T F me Pare /2 


-}m+tp) 


aris aod ee 2. 


(2) Contaminated normal. The contaminated normal distribution is a mix- 
ture of two normal distributions with proportional Covariance matrices and 
the same mean vector. The density can be written 


1 1 tard 
42 l-e —_—_____ efx #) Avl(x=p) 


e720 a= RYAN Cem py 
3 


] 
ee 
(27) |cAl} 
where c > 0 and 0 < e< 1. Usually e is rather small and c rather large. 


(3) Mixtures of normal distributions. Let w(v) be a cumulative distribution 
function over 0 <v < oo. Then a mixture of normal densities is defined by 


(43) [nla 53 dw(v), 


56 THE MULTIVARIATENORMAL D'STRIBUTION 


which is an elliptically contoured density. The random vector X with this 
density has a representation X =wZ, where Z ~ N(w, =) and w ~ w(w) are 
independent. 

Fang, Kotz, and Ng (1990) have discussed (43) and have given other 
examples of elliptically contoured distributions. 


PROBLEMS 


2.1. (Sec, 2.2) Let fx,y)=1,0<x<1,0<ysl, 
= 0, otherwise. 


Find: 


(a) F(x, y). 

(b) F(x). 

(e) fC). 

(d) f(xly). [Note: fCxgl yg) = 0 if fxg, yo) = 0.) 
(e) &X"Y™. 

(f) Prove X and Y are independent. 


2.2. (Sec, 2.2) Let f(x, y)=2, 0 <y<x<], 
= 0, otherwise, 


Find: 

(a) F(x, y). () f(rly). 

(b) F(x). (g) f(y! x). 

(c) f(x). (h) &X"y™, 

(d) Gy). (i) Are X and Y independent? 
(e) g(y). 


2.3. (Sec. 2.2) Let fx,y)=C for x?+y?<k? and 0 elsewhere. Prove C= 
1/(rk?), €X = €Y=0, €X?= €Y* =k? /4, and EXY=0. Are X and Y 
indeperdent? 


2.4. (Sec, 2.2) Let F(x;,x,) be the joint cdf of X,, X,, and let FAx,) be the 
marginal cdf of X,, /=1,2. Prove that if F(x,) ts continuous, j= 1,2, then 


F(x,, t2) is continuous. 


2.5. (Scc. 2.2) Show that if the set X;,...,X, ts independent of the set 
Xp iseees Mp, then 


EBX, 001, XA X44 never XQ) = ESC Xp 000) X,) OR Xp -- 2s Xp). 


7 57 
2.6. (Sec. 2,3) Sketch the ellipses f(x, y) = 0,06, where f(x,y) ts the bivariate 


2.7 


2.8. 


2.9. 


2.11. 


normal density with 


(a) uw, = 1, My = 2, o2=1, 0, = 1, Pry = 0. 


(b) w,=0, w,=0, o2=1, 02 =1, p,,=0. 


(ce) pw, =0, by = 0, a7 =1, a,” =1, p,, = 0.2, 
(d) p, =0, My, =O, a = 1, of = 1, Pry = 0.8. 


(e) pw, =0, My, =0, o,2 =4, a; =1, p,,= 0.8. 


(Sec, 2.3) Find b and A so that the following densities can be written in the 
form of (23). Also find f,, My, %, Fy and p,,. 


(a) 55 0xP(- L(x 1% + (y — 2)7}. 


x x? /4-1.6ay/2+y? 


©) zagenr|- Ag | 


(ce) ae exnl — 4x? +y2+4x—-6y + 13)). 


(d) a epl- L(2x2 +y? + Quy — 22x- 14y + 65)]. 


(Sec, 2.3) For each matrix A in Problem 2.7 find C so that C’/AC =I. 
(Sec. 2.3) Let b=0. 
7 3 2 
A=|]3 4 1 
2 1 2 


(a) Write the density (23). 
(b) Find &. 


(Sec. 2.3) Prove that the principal axes of (55) of Section 2,3 are along the 45° 
and 135° lines with lengths 2yc(1+p) and 2y¥c(1—p), respectively, by 
transforming according to y, =(z, +2,)/ V2.2 =(z,;—22)/ v2. 


(Sec. 2.3) Suppose the scalar random variables X,,...,.X, are independent 
and have a density which is a function only of x? + ++: +2. Prove that the X; 


- are normally distributed with mean 0 and common variance. Indicate the 


mildest conditions on the density for your proof. 


58 


2.12, 


2.15, 


2.16. 


2.17, 


2.18. 


2.19. 


2.20, 


2.21, 


2.22, 


THE MULTIVARIATE NORMAL DISTRIBUTION 


(Sec. 2.3) Show that if Pr{X > 0. Y= 0}=a for the distribution 


allele th 


then p=cos(]—2a)a. [Hint: Let X=U,Y= pU + ¥1— p?V and verify p= 
cos 27(4 — a) geometrically.] 


. (Sec. 2.3) Prove that if p,, =p, i#j,i,j=1,...,p, then p2 —1/(p— 1). 


» (Sec, 2.3) Concentration ellipsoid. Let the density of the p-component Y be 


f(y) =T4p + Dp + 2)a] for y’y <p+2 and 0 elsewhere. Then ¢Y=0 
and £YY' =I (Problem 7.4). From this result prove that if the density of X is 
e(x)=VIAI Pp +DA(p +20)” for (x— p)'A(e— p) <p +2 and 0 else- 
where, then €X=p and &(X—-p)(X-p)' =A7l. 


(Sec. 2,4) Show that when X is normally distributed the components are 
mutually independent if and only if the covariance matrix is diagonal. 


(Sec. 2.4) Find necessary and sufficient conditions on A so that AY +A has a 
continuous cdf. 


(Sec. 2.4) Which densities in Problem 2.7 define distributions in which X and 
Y are independent? 


(Sec, 2.4) 


(a) Writc the marginal density of X for each case in Problem 2.6. 


(b) Indicate the marginal distribution of X for each case in Problem 2.7 by the 
notation N(a. 6). 


{c) Write (he marginal density of X, and X, in Problem 2.9. 


(Sec. 2.4) What ts the distribution ol Z =X - Y when X and Y have each of 
the densities in Problem 2.6? 


(Sec. 2.4) What is the distribution of X, + 2X.—3X 3 when X,, X,,X3 have 
the distribution defined in Problem 2.97 


(Sec. 2.4) Let X¥=(X,,X,)'. where X,=X and X,=aX+b and X has the 
distribution N(0,1). Find the cdf of X. 


(Sec. 2.4) Let Xj,...,X, be independently distributed, each according to 
N( pt, 0”). 


(a) What is the distribution of X =(X,,...,X,)'? Find the vector of means 
and the covariance matrix. 


(b) Using Theorem 2.4.4, find the marginal distribution of X = 2X,/N. 


PROBLEMS 59 


2.23. 


2.24. 


2.25. 


2.26. 


2.27. 


(Sec. 2.4) Let X),...,X, be independently distributed with X, having distri- 
bution N( B+ yz;,o7), where z; is a given number, i=1,...,N, and £,z;=0. 


(a) Find the distribution of (X,,..., Xy)'- 
(b) Find the distribution of X and g= LX,z,/Lz? for Lz? >0. 


(Sec. 2.4) Let (X,, Y\)’, (Xo, ¥2)', (X45, Ys)’ be independently distributed, 
(X,, Y,)' according to 
MB Oy, Oy : 
N , ; i=1,2,3; 
Vp \ Oxy Fy 


(a) Find the distribution of the six variables. 
(b) Find the distribution of (X, Y)’. 


(Sec. 2.4) Let X have a (singular) normal distribution with mean 0 and 
covariance matrix 
_{4 2 
z= ( : i 


(a) Prove © is of rank 1. 


(b) Find @ so X=a’'Y and Y has a nonsingular normal distribution, and give 
the density of Y. 


(Sec. 2.4) Let 


(a) Find a vector wu #0 so that Xu =0. [Hint; Take cofactors of any column.] 
(b) Show that any matrix of the form G=(H wu), where H is 3 x2, has the 
property 


yr |e 0 
Gx6-| 0 a 


(c) Using (a) and (b), find B to satisfy (36). 
(d) Find B7' and partition according to (39). 
(e) Verify that CC’ =X. 


(Sec. 2.4) Prove that if the joint (marginal) distribution of X, and X, is 
singular (that is, degenerate), then the joint distribution of X,, X,, and X; is 
singular. 


60 


2.28. 


2.29, 


2.30. 


2.31. 


2.32. 


2.33. 


2.34, 


2.35. 


2.36, 


THE MULTIVARIATE NORMAL DISTRIBUTION 


(Sec. 2.5) In cach part of Problem 2.6, find the conditional distribution of X 
given Y=y, find the conditional distribution of Y given X =x, and plot each 
regresvion ling on the appropriale graph in Problem 2.6. 


(Sec. 2.5) Let w=0 and 


1. 0.80 —0.40 
L= 0.80 I. —0.56 |. 
-0.40 -0.56 I. 


(a) Find the conditional distribution of X, and X,, given X, =x. 
(b) What is the partial correlation between X, and X; given X,? 


(Sec. 2.5) In Problem 2.9, find the conditional distribution of X, and X, given 
X, =X3. 


(Sec. 2.5) Verify (20) directly from Theorem 2.5.1. 


(Sec. 2.5) 


(a) Show that finding o to maximize the absolute value of the correlation 
between X, and aX is equivalent to maximizing (a/,0)* subject to 
w'2..@ constant. 

(b) Find & by maximizing (o(,,)* — ACa'.. @ —c), where c is a constant and 
A is a Lagrange multiplier. i 


x 
(Sec. 2.5) Inuariance of the multiple correlation coefficient. Prove that R Rad eacaa 
is an invariant characteristic of the multivariate normal distribution of X, and 
X® under the transformation x* =b,x,+c; for b,#Q and X°*=HX@+E 
for H nonsingular and that every function of 4;, oj Oy, pO, and Z,. that is 
invariant is a function of K,; 


a Pp 
(Sec. 2.5) Prove that 
= 1 |) oy 
1-R?, =— F k,f=qtl,...,p. 
PED ee Se | px, | Pri Pry q 


(Sec. 2.5) Find the multiple correlation coefficient between X, and (X,, X3) 
in Problem 2.29. 


(Sec. 2.5) Prove explicitly that if & is positive definite, 


\z| = [2a _ Bs beg ai . | Zool 


PROBLEMS 61 


2.37. 


238. 


239. 


2.40. 


241. 


2.42, 


2.43. 


245. 


(Sec. 2.5) Prove Hadamard’s inequality 
Pp 
iZ| < [[o,. 
r=] 


| Hint: Using Problem 2.36, prove |Z| <o,,|Z.|, where %. is (p—1)x 
(p — 1), and apply induction.] 


(Sec. 2.5) Prove equality holds in Problem 2.37 if and only if & is diagonal. 


(Sec, 2.5) Prove Bj3.4 = 019.3/ 092.4 = Pi9.2%}.2/032 aNd By3.. = 013../033.. = 
P13.2 O}.2/94.21 where Oy-k ~ Finks 


(Sec. 2.5) Let (X,, X,) have the density # (x|0, 2) =f(x,, x,). Let the density 
of X, given X,=x, be f(x,|x,). Let the joint density of X,,X,,X, be 
F(X), X2)f(x4|x,). Find the covariance matrix of X,, X,,X, and the partial 
correlation between X, and X, for given X). 


(Sec. 2.5) Prove 1—Ri.3 = (1 — p},X1 — p},.3). [Hint Use the fact_that the 
variance of X,-in the conditional distribution given x, and x, is (1— Rj..3)0;,-] 


(Sec. 2.5) If p=2, cun there be a difference between the simple correlation 
between X, and x, and the multiple correlation between X, and X®@ =X,? 
Explain. 


(Sec. 2.5) Prove 


Piegal ccak-i FR 


Gigttyk—l,ktl,oecy Dp 
TT Phkega yr ieyka 1, RAL coe Oke gtl ecko k+l ye0.5P 


i=1,...,qg, kK=qt+1,..--,p, where Oj94 Lack 1 kK+ ley p ™ 
Ojj-g+t,..k-1ek+l,..,p> J = i,k. [Hint: Prove this for the special case k=q +1 
by using Problem 2.56 with p,=q, po=1,p,3=p—q- 1] 


. (Sec. 2.5) Give a necessary and sufficient condition for Rj.941,,..,) = 0 in terms 


of Giq+io sey Oips 


(Sec. 2.5) Show 


1 ae ee po (i = p,)(1 = P? p~1-p) bee (1 = Pcigeual 


[ Hint: Use (19) and (27) successively.] 


62 


2.46. 


2.47. 


2.48. 


2.49, 


2.50. 


2.51. 


2.52. 


2.53, 


THE MULTIVARIATE NORMAL DISTRIBUTICN 
(Sec. 2.5) Show 


2 = 
Pijget Pp Bij g+i WiGsa pPyrgqtt oo, Pp 


(Sec. 2.5) Prove 


[ Hint: Apply Theorem A.3.2 of the Appendix to the cofactors used to calculate 
a") 


(Sec. 2.5) Show that for any joint distribution for which the expectations exist 
and anv function h(x) that 


&(X,— €X\X)h(X) =0. 


[Hint: In the above take the expectation first with respect to X; conditional 
on X®)] 


(Sec, 2.5) Show that for any function h(x) and any joint distribution of X, 
and X) for which the relevant expectations exist, &[X,—h(x@))? = &LX, - 
e( X)P + Sl g(X) — h(x), where g(x) = &X,\x@ is the conditional 
expectation of X;, given X@ =x. Hence g(X@) minimizes the mean squared 
error of prediction. | Hint: Use Problem 2.48.] 


(Sec. 2.5) Show that for any function A(x®)) and any joint distribution of X, 
and X®) for which the relevant expectations exist, the correlation between X; 


and A(X) is not greater than the correlation between X, and g(X), where 
g(x) = 8X, |x@, 


(Sec. 2.5) Show that for any vector function h(x) 


é[ x a h(x™)| [x@ =. x)]' — &[X -— 6x X@][ xO - exO| xO] 
is positive semidefinite. Note this generalizes Theorem 2.5.3 and Problem 2.49. 


(Sec. 25) Verify that £,,%3,) = —Wy'W,., where W = 7! is partitioned 
similarly to 2. 


(Sec. 2.5) Show 


roe Line — 2h 2 2x 
—EpZanbi2 TE Lake Ey' + Zz! 
eames + : =nbd -B 
9 =5 _p' 1-2 ); 


where B=2,,25,'. [Hint: Use Theorem A.3.3 of the Appendix and the fact 
that £~' is symmetric.] 


PROBLEMS 63 


2.54. 


2.55. 


2.56. 


2.57, 


2.58. 


Q 
Q 


(Sec. 2.5) Use Problem 2.53 to show that 


Es (x9 —¥ EG x@y yp (xO — ¥, Epix) +x Eze, 


(Sec. 2.5) Show 
E (XM x2, xO) = pO + EEF (xO — pw) 
oe ao -1 
+ (21 — £13233! L532 )(Xy -Inry X32) 
-[x@ — pO) —¥,. 35) (29) - p?))] ; 
(Sec. 2,5) Prove by matrix algebra that 


Sy Ib Fe 


En - uko)| 3" 3 |= 2a EotaEa 


Sega (See ta tay (nH tate es). 


(Sec. 2.5) Invariance of the partial correlation coefficient. Prove that py2.4,_..,p is 
invariant under the transformations x* = a,x, + bx + ¢,, a;> 0, i=1,2, xO* 
= Cx® +d, where x =(x3,...,x,)‘, and that any function of p and & that is 
invariant under these transformations is a function of py23,__ » 


(Sec. 2.5) Suppose X and X@ of q and p—g components, respectively, 
have the density 


_l4F oto 
(20 )” 
where 
Q = (x0) = py’ (xO — pO) + (2 = pA (x@ — p2) 
(x = pO)" Ag (x = pw) + (x2) = w)A g(x — py. 
Show that Q can be written as Q, + Q., where 


i= [ (x — po) + Aji 1o( x@ es ye) ]' 41, [( x ~ pO) + AGMA (x@ - @)] 


2 = (4% = w) "(Ay — An An Air) (x — pO). 


Show that the marginal density of X@) is 


a 4 
|Az2— An An Argl? - 40, 
(Qa yrr-? 


Show that the conditional density of X® given X¥@ = x@ is 
(27 )74 


(without using the Appendix). This problem is meant to furnish an alternative 
proof of Theorems 2.4.3 and 2.5.1. 


64 


THE MULTIVARIATE NORMAL ea 


2.59. (Sec. 2.6) Prove Lemma 2.6.2 in detail, ‘a 


2.60, 


2.61. 


2.62. 


2.64. 


(Sec. 2.6) Let Y be distributed according to N(0,2). Differentiating the 
characteristic function, verify (25) and (26). 


(Sec. 2.6) Verify (25) and (26) by using the transformation X — p= CY, where 
2% =CC', and integrating the density of Y. 


(Sec. 2.6) Let the density of (X,Y) be 
2n(xl0,1)n(yl0,1), O<ysx<om, OS -r<y<am, 
Q<-ys-x<c, O<xS-y em, 
0 otherwise. 


Show that X,Y,X + Y, X — Y each have a marginal normal distribut‘on. 


. (Sec. 2.6) Suppose X is distributed according to N(@, XZ). Let = = (6 ,.+4,0,) 
Prove 
Gio, GG} 
&(XX' @XX')=LOL+vecE(vecL)' +] - ; 
710, 7,9, 


where 


and ¢; is a column vector with 1 in the /th position and (’s elsewhere. 


Complex normal distribution. Let (X', Y’)’ have a normal distribution with mean 
vector (wy, py)’ and covariance matrix 


r -@® 
z-(5 7) 
where I’ is positive definite and & = —®' (skew symmetric). Then Z=X+i¥ 
is said to have a complex normal distribution with mean @=jry+ipy and 
covariance matrix &(Z —@)(Z — 0)* =P=@Q@+iR, where Z* = X’ — iY’. Note 
that P is Hermitian and positive definite. 


‘a) Show Q=2F and R=2®. 
(b) Show | PI? = |2E). [Hine [0 + iD| =(0 — i@).] 


PROBLEMS 65 
(ec) Show 


P-'=(9+RQ™'R)  —iQ'R(@+RQ™"'R) 


Note that the inverse of a Hermitian matrix ts Hermitian. 
(d) Show that the density of X and ¥ can be written 


P| Pl be OF Pe 0) 


. Complex normal (continued). \f Z has the complex normal distribution of 
Problem 2.64, show that W = AZ, where A is a nonsingular complex matrix, has 
the complex normal distribution with mean A® and covariance matrix ¢(W)= 
APA*. 


2.66. Show that the characteristic function of Z defined in Problem 2.64 is 
fei Riu*Z) zg Ru O-ue Pu 
where A(x + iy) =x. 
2.67. (Sec. 2.2) Show that [*,e7*'/? dx/¥2m is approximately (1 — e7?@/™)¥/2, 
[ Hint: The probability that (X,Y) falls in a square is approximately the 
probability that (X,Y) falls in an approximating circle [Pélya (1949)}] 


2.68. (Sec. 2.7) For the multivariate ¢t-distribution with density (41) show that 
&X=p and G(X) =[m/(m — 2A. 


wate 


CHAPTER 3 


Estimation of the Mean Vector 
and the Covariance Matrix 


3.1. INTRODUCTION 


The multivariate normal distribution is specified completely by the mean 
vector ys and the covariance matrix &. The first statistical problem is how to 
estimate these parameters on the basis of a sample of observations. In 
Section 3.2 it is shown that the maximum likelihood estimator of p is the 
sample mean; the maximum hkelihood estimator of { is proportional to the 
matrix of sample variances and covariances. A sample variance is a sum of 
squares of deviations of observations from the sample mean divided by one 
less than the number of observations in the sample; a sample covariance is 
similarly defined in terms of cross products. The sample covariance matrix is 
an unbiased estimator of 2. 

The distribution of the sample mean vector is given in Section 3.3, and it is 
shown how one can test the hypothesis that is a given vector when »& is 
known. The case of 2 unknown will be treated in Chapter 5. 

Some theoretical properties of the sample mean are given in Section 3.4, 
and the Baves estimator of the population mean is derived for a normal a 
prion distribution. In Section 3.5 the James-Stcin estimator is introduced: 
improvements over the sample mean for the mean squared error Joss func- 
tion are discussed. 

In Section 3.6 estimators of the mean vector and covariance matrix of 
elliptically contoured distributions and the distributions of the estimators are 
treated. 


An Introduction to Multivariate Staustical Analysis, Third Edition. By T. W. Anderson 
ISBN 0-471-36091-0 Copyright © 2003 John Wiley & Sons, Inc. 


66 


3,2. ESTIMATORS OF MEAN VECTOR AND COVARIANCE MATRIX 67 


3.2. THE MAXIMUM LIKELIHOOD ESTIMATORS OF THE MEAN 
VECTOR AND THE COVARIANCE MATRIX 


Given a sample of (vector) observations from a p-variate (nondegenerate) 
normal distribution, we ask for estimators of the mean vector wp and the 
covariance matrix % of the Jistribution. We shal] deduce the maximum 
likelihood estimators, 

It turns out that the method of maximum likelihood is very useful in 
various estimation and hypothesis testing problems concerning the multivari- 
ate normal distribution. The maximum likelihood estimators or modifications 
of them often have some optimum properties. In the particular case studied 
here, the estimators are asymptotically efficient [Cramér (1946), Sec. 33.3]. 

Suppose our sample of N observations on X distributed according to 
N(p, 2%) is x),..., 2% , where N > p. The likelihood function is 


N 


(1) L= TY n(xalm, 2) 


a= 


1 
= ue oP] ~ OL (Xa BE" Xa— BH). 

(27)? ™) S| 28 2 
In the likelihood function the vectors x,,,..,x, are fixed at the sample 
values and L is a function of » and ¥. To emphasize that these quantities 
are variables (and not parameters) we shall denote them by p* and &*. Then 
the logarithm of the likelihood function is 


(2) log L = ~5pN log2a— 5N log] &*| 


N 
—2 2 (a7 BT) EF (2g - BF). 

a=] 
Since log L is an increasing function of L, its maximum is at the same point 
in the space of p*,2* as the maximum of L. The maximum likelihcod 
estimators of p and & are the vector p* and the positive definite matrix &* 
that maximize log L. (It remains to be scen that the supremum ot log L is 
attained for a positive definite matrix X*.) 

Let the sample mean vector be 


1 N 
ee x, 
1 N a=] 
(3) r= ee Ps ‘: = a i 
a=] N 


68 ESTIMATION OF THE MEAN VECTOR AND THE COVARIANCE MATRIX 


where x, =(x14,---)Xpq) and X,= DN _1X:a/N, and let the matrix of sums 
of squares and cross products of deviations about the mean be 


N 
(4) A= yy (Xy—¥) (xy 7)! 


a=l 
N 
=1 2) (ta —%) (427%), LF =l,..y pe 
a=! 
It will be convenient to use the following lemma: 


ft 
Lemma 3.2.1. Let x,,...,%,y be N (p-component) vectors, and let X be 
defined by (3). Then for any vector b 


iN N 
(5) 2 (%_7 5)(%_— bY’ = L (4, -¥)(%_—¥)! + N(¥—5)(¥- 5)’. 


a=] 
Proof 
(6) 


N N 
2 (ae a)" = u [(x_ -£) + (£-5)] [(x, -%) + (¥-8)]’ 


N 
= 2 [(4e7-¥)(%_-¥)! + (Xa ¥)(F— 5)! 


awl 


+(¥—b)(x, —¥)' + (x-xb)(¥-5)'| 
N N 
= u (x, —¥)(x, —¥)' + | L (-a)|6 —b)' 


+(¥-5) > (x, —%)'+N(¥—b)(X—-5)’. 


a=] 


The second and third terms on the right-hand side are 0 because L(x, — xX) = 
Ex, — NE=0 by (3). es 


When we let 5 = p*, we have 


(7) 
N N 
Y (x, - BP") (x, — BY) = y (x, —¥)(x,—*)' + N(¥— pw) (E- “a 


=A+N(x-p")(x—p*)’. 


32 ESTIMATORS OF MEAN VECTOR AND COVARIANCE MATRIX 69 


Using this result and the properties of the trace of a matrix (tr CD = Lc,,d,, 
= tr DC), we have 


(8) 
N ' N 
ED (ze— hy E*'(e,—wt) ate (xg wtyE (2, 
N 
¢ =te EE (x,—ph)(xe— wy) 


a=l 
=tr*-'4+tr&*'N(- pt)(z- pt)’ 
= tr E14 + N(E— pty'S*'(z— pt), 
Thus we can write (2) as 
(9) log L = — $pN log(27) — 3N logi &*| 
— Hr B*714 — EN( 2 phy E918 = pt). 


Since &* is positive definite, £*~' is positive definite, and N(¥— 
p*)'=*>!(F — p*) > 0 and is 0 if and only if p* =¥. To maximize the second 
and third terms of (9) we use the following lemma (which is also used in later 
chapters): 


Lemma 3.2.2. if D is positive definite of order p, the maximum of 
(10) f(G) =N\logiG| — trG"'D 


with respect to positive definite matrices G exists, occurs at G=(1/N)D, and 
has the value 


(11) f[(4/N)D] =pN log N— N log|D| —pN. 


Proof. Let D = EE' and E'G"'E=H. Ther G=EH™'E’, and |G| = |E| 
‘Ho ) -1£’| = |W-!|- | BE] =|DI/\H\, and tr G"'D = tr G~'EE’ = 
tr E'G~'E = trH. Then the function to be maximized (with respect to posi- 
tive definite HF) is 


isha 
02).2 = f=—N log|D| + N log|Hl — tr A. 


‘Let Hg TT’, where T is lower triangular (Corollary A.1.7). Then the 
maximiin of 


(aay f= —N log|D| +N log|Tl? ~ TT’ 


B 
= —N log|D| + } (N logt?~t2) - Nit 
i=l i>j 


70 ESTIMATION OF THE MEAN VECTOR AND THE COVARIANCE MATRIX 


occurs at t;=N, 1, +0, ij; that is, at H=NI. Then G=(1/N)EE’ = 
(1/N)D. | 


Theorem 3.2.1. Ifx,,...,x,, constitute a sample from N(, %) with p< N, 
the maximum likelihood estimators of and % are fu =X=(1/N)L4_,x, and 
Y=(1/N)IN (x, — ¥Xx, — £)', respectively, 


Other methods of deriving the maximum likelihood estimators have been 
discussed by Anderson and Olkin (1985). See Problems 3.4, 3.8, and 3.12. 

Computation of the estimate & is made easier by the specialization of 
Lemma 3.2.1 (b = 0) 


N N 
(14) (x, -*)(*4,-%)' = Yl xx - Nar’. 
=1 a=] 


a 


An element of £3_,x, x), is computed as D¥_,x,.%j_, and an element of 
Nxx‘ is computed as Nx,x, or (LY_,x,.XL2_,4,.)/N. It should be noted 
that if N > p. the probability is 1 of drawing a sample so that (14) is positive 
definite; see Problem 3.17. 

The covariance matrix can be written in terms of the variances or standard 
deviations and correlation coefficients. These are uniquely defined by the 
variances and covariances. We assert that the maximum likelihood estimators 
of functions of the parameters are those functions of the maximum likelihood 
estimators of the parameters. 


Lemma 3.2.3. Let f(@) be a real-valued function defined on a set S, and let 
@ be a single-valued function, with a single-valued inverse, on § to a set S*; that 
is, to each @€§ there corresponds a unique 0* € S*, and, conversely, to each 
8* & S* there corresponds a unique @€ S. Let 


(15) g(0*) =f[o°'(@*)]. 


Then if f(@) attains a maximum at 9= 0), g(@*) attains a maximum at 
6* = 63 = 6(0,). If the maximum of f(6) at 8) is unique, so is the maximum 
of g(0*) at 6F. 


Proof. By hypothesis f(8)) = f(@) for all @€ S. Then for any 9* € S* 
(16) g(8*) =f[67'(6*)] =f(8) <f( 4) =8[ 6( G0)] =8( 8). 
Thus g(e*) attains a maximum at 65. If the maximum of f(@) at @, is 


unique, there is strict inequality above for @# 0), and the maximum of g(é@*) 
is unique. a 


3.2 ESTIMATORS OF MEAN VECTOR AND COVARIANCE MATRIX 71 


We have the following corollary: 


Corollary 3.2.1. If on the basis of a given sample 6,,..., 8, are maximum 
likelihood estimators of the parameters 0,,...,8, of a distribution, then 
(81, os ee Ce are maximum Iikelihood estirzators of 
PCBs + 0-5 Gods ~ee> PCO, +++s Om) tf the transformation from 9,,...,9,, to 


ir-++> Bm #5 One-to-one.’ If the estimators of @,,.-.,6,, are unique, then the 
estimators of $,---,) b, are unique. 


Corollary 3.2.2. [f x),.-.,X,) constitutes a sample from N(p,%), where 
07; = 9,9; 9; (p= 1), then the maximum likelihood estimator of p is p=xX= 
(1/N)L,x,; the maximum likelihood estimator of o,? is 6° =(1/N)L,(1,, ~ 
x,)? =(1/N\L, x2, — Nx?), where x;, is the ith component of x, and X, is the 
ith component of x, and the maximum likelihood estimator of p,; is 


PEAR | Xia —%))(%jq —x,) 


VENA 2e any VD¥( Le -z) 


ee ja — NX;x X;x, 


if 2 
Lie ee yA a Nx; 


(17) pij = 


Proof. The set of parameters pu, = ,, 9,’ = 9;,, and p= 0;;/ ¥o,0,; isa 
one-to-one transform of the set of parameters yu; and 9;;. ene by 


Corollary 3.2.1 the estimator of yu; is &,, of o;’ is 6;;, and of D,; 
A 6;; 
(18) P, } Se eee a 
8 5 


Pearson (1896) gave a justification for this estimator of p;,,, and (17) is 
sometimes called the Pearson correlation coefficient. It is also called the 
simple correlation coefficient. It is usually denoted by r, . 


tT The assumption thal the transformation is one-lo-one jx made so thal the sct #y,...,¢,, 
uniquely defines the likelihood, An alternative in casc @* = #€4) docs not have a unique inverse 
is to define s(6*) = (0: ¢(@) = 6*) and g(6*) = sup fle)| ge S(9*), which is considered the 
“induced likelihood” when f() is the likelihood function. Then 8* = g(6) maximizes g(6*), 


for g(@*) = sup f(@)| 6& S(6*) = sup fle)lo ES =f(6)=g(6*) for all 6* ES*. [See, e.g., 
Zehna (1966).] 


72 ESTIMATION OF THE MEAN VECTOR AND THE COVARIANCE MATRIX 


i, 


du, 


Figure 3.1 


A convenient geometrical interpretation of this sample (x, x.,..., xy) - 
is in terms of the rows of X. Let 


Ti, TIN uy , 

(19) X=|: aaa ee ee + 
ge , 
Xpt XpNn uy 


that is, uw’ is the ith row of X. The vector u; can be considered as a Vector in 
an N-dimensional space with the ath coordinate of one endpoint being x,, 
and the other endpoint at the origin. Thus the sample is represented by p 
vectors in N-dimensional Euclidean space. By definition of the Euclidean 
metric, the squared length of u; (that is, the squared distance of one 
endpoint from the other) is uu; = LN_,x?,. 

Now let us show that the cosine of the angle between u, and u, is 
uju,/ fue ui) = Lynx, )a/ ¥ Uae Xia lei *}a+ Choose the scalar d so 
the vector du, is orthogonal to u, — du;; that is, 0 = du (u; — du,) =d(uju, - 
duu,). Therefore, d=u'ju;/uiu; We decompose u, mto u;—du, and du; 
[u, = (u, —du,) + du,] as indicated in Figure 3.1. The absolute value of the 
cosine of the angle between u; and u, is the length of du, divided by the 
length of u,; that is, it is y/du'(du,)/uju, = dujujd/u,u;; the cosine is 
uju)/ Vuiuiuu,- This proves the desired result. 


To give a geometric interpretation of a,; and a,,/ ¥ 4:14)» we introduce 
the equiangular line, which is the line going through the origin and the point 
(1,1,...,1). See Figure 3.2. The projection of u, on the vector e = (1,1,..., 1)’ 
is (e'u,/e'ele =(Lx,,/L, De =x,e = (X,, X,,...,%,)’. Then we decompose 
u; into x,e, the projection on the equiangular line, and u,—x,e, the 
projection of uw, on the plane perpendicular to the equiangular line. The 
squared length of u,—x,e is (u,;—x,e)'(u,~xz,e) = L,(x,, —X,"; this is 
NG, =a,,. Translate u,;-x,e and u,—x,e, so that.each vector has an end- 


i 
point at the origin, the ath coordinate of the first vector is x;, —X,, and of 


2 ESTIMATORS OF MEAN VECTOR AND COVARIANCE MATRIX 73 


a — 3,8 
t Figure 32 


1 


the second is x,,—x,. The cosine of the angle between these two vectors is 


(u, —%;e)'(u, —%e) 


(20) ot a 


(u, ~X,8)'(u, -%,e)(u, —x,e)'(u, -x,e) 


N 
5 s (Xq~¥;)( Xj ~X,) 
a= 


N 
1D (X44 ~%;)’ 


a= 


2» (Xj. ~i,)’ 


As an example of the calculations consider the data in Table 3.1 and 
graphed in Figure 3.3, taken from Student (1908). The measurement x,, = 1.9 
on the first patient is the increase in the number of hours of sleep due to the 
use of the sedative A, x,, = 0.7 is the increase in the number of hours due to 


Table 3.1. Increase in Sleep 


: ; Drug 4 Drug B 
f Patient xi x2 
1 1.9 0.7 
at 2 0.8 —1.6 
3 11 -0.2 
‘ 4 0.1 ~12 
a 5 -0.1 ~0.1 
6 44 3.4 
7 5.5 3.7 
8 1.6 0.8 
9 46 0.0 
10 3.4 2.0 


74 ESTIMATION OF THE MEAN VECTOR AND THE COVARIANCE MATRIX 


2 


Figure 3.3. Increase in sleep, 


sedative B, and so on. Assuming that each pair (i.e., each row in the table) is 
an observation from N(p, 2), we find that 


ee ee be 
i 0.75 }’ 
{3.61 2.56 
? = 
ee) > be ial 
5= be ou 
2.85 3.20)” 


and py. =P). = 0.7952. (S$ will be defined later.) 


3.3. THE DISTRIBUTION OF THE SAMPLE MEAN VECTOR; 
INFERENCE CONCERNING THE MEAN WHEN THE COVARIANCE 
MATRIX IS KNOWN 


3.3.1. Distribution Theory 


In the univariate case the mean of a sample is distributed normally and 
independently of the sample variance. Similarly, the sample mean X defined 
in Section 3.2 is distributed normally and independently of &. 


3.3. THE DISTRIBUTION OF THE SAMPLE MEAN VECTOR 75 


To prove this result we shall make a transformation of the set of observa- 
tion vectors. Because this kind of transformation is used several times in this 
book, we first prove a more general theorem. 


Theorem 3.3.1. Suppose X,,...,X, are independent, where X, is dis- 
tributed according to N(w,,%). Let C = (cag) be an N XN orthogonal matrix. 
Then Y,=Lje\CagXg is distributed according to N(v,,%), where v,= 
LeeiCagttas a=1,...,N, and Y,,...,¥, are independent. 


Proof. The set of vectors Y,,...,¥, have a joint normal distribution, 
because the entire set of components is a set of linear combinations of the 
components of X,,...,X,, which have a joint normal distribution. The 
expected value of Y, is 


N N 
(1) EY,=& Y cagXp= Lo Cag Xp 
B=1 B=1 
N 
= » Cap Mp = Ya: 
p=1 


The covariance matrix between Y, and Y, is 


(2) @(¥,,¥)) = &(¥, — v,)(¥,— v,)’ 
N N 
=é & ce(Xs~ Ms) Porceray 
N 
7 am, Cap Cys &(X,- Bs )(X, — B,)’ 


N 
= » Cap lye 5g. D> 
B,e=1 


where 6,, is the Kronecker delta (=1 if a=y and =0 if a¥y). 


This shows that Y, is independent of Y,, a# y, and Y, has the covariance 
matrix 2. | 


76 ESTIMATION OF THE MEAN VECTOR AND THE COVARIANCE MATRIX 
We also use the following general lemma: 


Lemma 33.1. /f C=(c,,) is orthogonal, then DN) x,%,= Dee ya Ve 
where ya = U5 1CapXpg, @=1,...,N. 


Proof 


N 
(3) Le Yaa Le Uicaptp Le cay 
a=] a £B y 


ob [ Cesena 


Biy' = 


= » Oby XpXy 
By 


UxpXp- a 
8 


Let X,,...,X, be independent, each distributed according to N(p, 2). 
There exists an N X N orthogonal matrix B = (b,,) with the last row 


(4) (1/VN,...,1/vN). 


(See Lemma A.4.2.) This transformation is a rotation in the N-dimensional 
space described in Section 3.2 with the equiangular line going into the Nth 
coordinate axis. Let Ad = NX, defined in Section 3.2, and let 


N 
(5) Z,= YL bapXp- 
p=1 
Then 
N N 4 UNE 
6 Zy= bi, X,= — X, =vNX. 
(6) vo Le one ke= Le ae Xe 
By Lemma 3.3.1 we have 
N — 
(7) A= YX, Xi — NXX’ 
a=1 
N 
= 0 2,Z,-2ZyZy 
a=l1 
N-1 


il 
M 
N 
N 


3.3 THE DISTRIBUTION OF THE SAMPLE MEAN VECTOR 77 


Since Z,, is independent of Z,,...,2Z,-,, the mean vector X is independent 
of A. Since 

N 
(8) éZy= Yo byg €Xg = ig rd p=VNp, 

B= 


Z,, is distributed according to N(VN p, Z) and X =(1/VN)Z, is distributed 
according to N[p,(1/N)%]. We note 


83 
N 
i 


N N 
(9) a » bap éX,= » bape 
B=1 B=1 


N 
» ba gbyaVN 
B=} 


= 0, a#tN, 


Theorem 3.3.2. The mean of a sample of size N from N(p, %) is distributed 
according to N[p, (1 /N)%] and independently of &, the maximum likelihood 
estimator of %. N& is distributed as LN=!Z,Z',, where Z, is distributed 


according to N(0,%), a=1,...,N— 1, and Zi, .,Zy—, are independent. 


Definition 3.3.1. An estimator t of a parameter vector ® is unbiased if and 
only if &t=0 


Since 8X =(1/N) &LN.,X, =p, the sample mean is an unbiased estima- 
tor of the population mean. However, 


N~1 = 
(10) si=Té e Pop a Ses: 


a=l1 


Thus & is a biased estimator of &. We shall therefore define 


: N 
(11) $= yi qa= eg Le BH -8) 


a=] 


as the sample covariance matrix. It is an unbiased estimator of % and the 
diagonal elements are the usual (unbiased) sample variances of the compo- 
nents of X. 


3.3.2. Tests and Confidence Regions for the Mean Vector When the 
Covariance Matrix Is Known 


A statistical problem of considerable importance is that of testing the 
hypothesis that the mean vector of a normal distribution is a given vector. 


78 ESTIMATION OF THE MEAN VECTOR AND THE COVARIANCE MATRIX 


and a related problem is that of giving a confidence region for the unknown 
vector of means. We now go on to study these problems under the assump- 
tion that the covariance matrix = is known. In Chapter 5 we consider these 
problems when the covariance matrix is unknown. 

In the univariate case one bases a test or a confidence interval on the fact 
that the difference between the sample mean and the population mean is 
normally distributed with mean zero and known variance; then tables of the 
normal distribution can be used to set up Significance points or to compute 
confidence intervals. In the multivariate case one uses the fact that the 
difference between the sample mean vector and the population mean vector 
is normally distributed with mean vector zero and known covariance matrix. 
One could set up limits for each component on the basis of the distribution, 
but this procedure has the disadvantages that the choice of limits is some- 
what arbitrary and in the case of tests leads to tests that may be very poor 
against some alternatives, and, moreover, such limits are difficult to compute 
because tables are available only for the bivariate case. The procedures given 
below, however, are easily computed and furthermore can be given general 
intuitive and theoretical justifications. 

The procedures and evaluation of their properties are based on the 
following theorem: 


Theorem 3.3.3. If the m-component vector Y is distributed according to 
N(v,T) (nonsingular), then ¥'T~'Y is distributed according to the noncentral 
x°-distribution with m degrees of freedom and noncentrality parameter v'T™| wv. 
If v =0, the distribution is the central y*-distribution. 


Proof. Let C be a nonsingular matrix such that CTC'=J, and define 
Z = CY. Then Z is normally distributed with mean €Z = C&Y= Cv=\X, say, 
and covariance matrix 6(Z-AXZ~A) = EC(Y— wXY - v)'C' =CTC' =I. 
Then ¥Y'T '¥ =Z(C')'T'C'Z = Z'(CTC')"'Z = Z'Z, which is the sum of 
squares of the components of Z. Similarly v'T~!v=X'A. Thus Y’'T"'Y is 
distributed as £7,Z?, where Z,,...,Z,, are independently normally dis- 
tributed with means A,,..., A,,, respectively, and variances 1. By definition 
this distributic n is the noncentral y?-distribution with noncentrality parame- 
ter D7, A?. See Section 3.3.3. If A, = ++ = A,, = 0, the distribution is central. 
(See Problem 7.5.) 2 


Since YN (X — ») is distributed according to N(0, Z), it follows from the 
theorem that 


(12) N(X- p)'7'(X— p) 


3.3. THE DISTRIBUTION OF THE SAMFLE MEAN VECTOR 79 


has a (central) y*-distribution with p degrees of freedom. This is the 
fundamental fact we use in setting up tests and confidence regions concern- 
ing p. 

Let x(a) be the number such that 


(13) Pr{ x2 > x2(a)} = 
Thus 
(14) Pr{ N(X— p)'E'(X- p)>x2(a)}=a. 


To test the hypothesis that p= py, where py is a specified vector, we use as 
our critical region 


(15) N(¥— po)'Z"(E— Bo) > xp (2). 


If we obtain a sample such that (15) is satisfied, we reject the null l1ypothesis. 
It can be seen intuitively that the probability is greater than a of rejecting 
the hypothesis if p. is very different from yy, since in the space of x (15) 
defines an ellipsoid with center at 4), and when p is far from 1, the density 
of x will be concentrated at a point near the edge or outside of the ellipsoid. 
The quantity NCX — py EX ~ (1p) is distributed as a noncentral y* with 
p degrees of freedom and noncentrality parameter N(p — pro)'2~ Cp — pry) 
when X is the mean of a sample of N from N(p,) [given by Bose 
(1936a), (1936b)]. Pearson (1900) first proved Theorem 3.3.3 for v= 0. 

Now consider the following statement made on the basis of a sample with 
mean ¥: “The mean of the distribution satisfies 


(16) N(— ph) 7'(%— wr) $ xp (a) 


as an inequality on p*.” We see from (14) that the probability that a sample 
will be drawn such that the above statement is true is 1— a@ because the 
event in (14) is equivalent to the statement being false. Thus, the set of p* 
satisfying (16) is a confidence region for p with confidence 1 ~ a. 

In the p-dimensional space of ¥, (15) is the surface and exterior of an 
ellipsoid with center pro, the shape of the ellipsoid depending on &7! and 
the size on (1/N)x;(q@) for given X~'. In the p-dimensional space of p* 
(16) is the surface and interior of an ellipsoid with its center at x. If 27! =J, 
then (14) says that the probability is a that the distance between X and p is 
greater than x, ( a)/N. 


80... ESTIMATION OF THE MEAN VECTOR AND THE COVARIANCE MATRIX 


Theorem 3.3.4. If x is the mean of a sample of N drawn from N(p, %) and 
& is known, then (15) gives a critical region of size a for testing the hypothesis 
1. = Wy, and (16) gives a confidence region for p of confidence 1— a. Here 
x;(a@) is chosen to satisfy (13). 


The same technique can be used for the corresponding two-sample prob- 
lems. Suppose we have a sample {x}, a=1,...,.N,, from the distribution 
NC pe, XY), and a sample {x}, @=1,...,.N,, from a second normal popula- 
tion M( ye, 2) with the same covariance matrix. Then the two sample 
means 


(17) R= YL xl 


are distributed independently according to N[ w’?,(1/N,)Z] and 
N{ »®,(1/N,)2], respectively. The difference of the two sample means, 
y=x —Z@, is distributed according to Nlv,[(1/N,) + (1/N,JIZ}, where 
v =p) — y, Thus 


NUN. 
( ) N, +N, (9 v) (y v) Xp (a) 
is a confidence region for the difference » of the two mean vectors, and a 


critical region for testing the hypothesis p“ = .@ is given by 


NGNG cat tnt et op Set, Ss 
(19) N, 4N, (0 ~¥®) Sia aa) Sea) e.. 


Mahalanobis (1930) suggested (p{ — p)'S—' (pw — px) as a measure of 
the distance squared between two populations. Let C be a matrix such that 
XY =CC' and let v = C7! p®, 7 = 1,2. Then the distance squared is (»"!) — 
v)"( yO — »@), which is the Euclidean distance squared. 


3.3.3. The Noncentral y *-Distribution; the Power Function 


The power function of the test (15) of the null hypothesis that p = py can be 
evaluated from the noncentral y?-distribution. The central y?-distribution is 
the distribution of the sum of squares of independent (scalar) normal 
variables with means 0 and variances 1; the noncentral y ?-distribution is the 
generalization of this when the means may be different from 0. Let Y (of p 
components) be distributed according to N(A, JZ). Let Q be an orthogonal 


3.3. THE DISTRIBUTION OF THE SAMPLE MEAN VECTOR 81 


matrix with elements of the first row being 


(20) nit ==: Le lea ps 


Then Z= QY is distributed according to N(t, I), where 


(21) t= 


and r=VA'A. Let V="¥=Z'Z=L?7.,Z?. Then W=L?,Z7 has a x°- 
distribution with p—1 degrees of freedom (Problem 7.5), and Z, and W 
have as joint density 


1 ean l 


Mp-1)-1 4-36 


(22) 
Ce Hat tritw ly Mpa ots 


x 
= Ce Marrazi+w) p-3) > 


a=0 


where C-! = 2¥VaT[4(p — 1]. The joint density of V=W+Z? and Z, is 
obtained by substituting w = v ~ z? (the Jacobian being 1): 


x 


23 Cer itt +e) — 92 3(p-3) y "zy 
( ) . (v zi) a! * 
a=0 


The joint density of V and U=Z,/VV is (dz, = Vudu) 


le a) 


1 
2A os 4 p=3 Tu 
(24) Ce 3(4? +0) 3p 1 u?y? p-3) > a ; 
a=0 


The admissible range of z, given v is ~ Vv to vv, and the admissible range 
of u is —1 to 1. When we integrate (24) with respect to u term by term, the 
terms for a odd integrate to 0, since such a term is an odd function of u. In 


82 ESTIMATION OF THE MEAN VECTOR AND THE COVARIANCE MATRIX 


the other integrations we substitute u= vs (du = }ds/ys) to obtain 
(25) [ (- yO 2B du u= af (1 = 02)? 28 dy 


ss fra — 5)? 58-3 ds 
0 


= Bl2(p-1), B+ 3] 
_ TE -vr(b +3) 
(zp +B) 
by the usual properties of the beta and gamma functions. Thus the density of 


V is 


ei uF TB #3). 
*) oe eee > (28)! F(ip +6)’ 


We can use the duplication formula for the gamma function T(2B + 1) = (2B)! 
(Problem 7.37), 


(27) T(2B+1)=I(B+4)l( B+ 1)274/vxr, 
to rewrite (26) as 

1 = 1 
eo eo gee 4 -| BIGp+B)” 


This is the density of the noncentral x7-distribution with p degrees of free- 
dom and noncentrality parameter 7”. 


Theorem 3.3.5. If Y of p components is distributed according to N(A, D), 
then V = ¥'Y has the density (28), where T* =X'R. 


To obtain the power function of the test (15), we note that VN (X - Po) 
has the distribution N[¥N (p — .)), &]. From Theorem 3.3.3 we obtain the 
following corollary: 


Corollary 3.3.1. If X is the mean of a random sample of N drawn from 
N(w, %), then NCX — po)’ 27 'CX — py) Aas a noncentral y*-distribution with p 
degrees of freedom and noncent-ality parameter N(p. — py)" (pw — py). 


3.4 THEORETICAL PROPERTIES OF ESTIMATORS OF THE MEAN VECTOR 83 


3.4. THEORETICAL PROPERTIES OF ESTIMATORS 
OF THE MEAN VECTOR 


3.4.1. Properties of Maximum Likelihood Estimators 


It was shown in Section 3.3.1 that ¥ and S$ are unbiased estimators of p and 
2, respectively. In this subsection we shall show that ¥ and S are sufficient 
statistics and are complete. 


Sufficiency 

A Statistic T is sufficient for a family of distributions of X or for a parameter 
@ if the conditional distribution of X given T = ¢ does not depend on 4 [e.g., 
Cramér (1946), Section 32.4]. In this sense the statistic T gives as much 
information about @ as the entire sample X. (Of course, this idea depends 
strictly on the assumed family of distributions.) 


Factorization Theorem. A statistic ty) is sufficient for ® if and only if the 
density f(y|®) can be factored as 


(1) f(y16) =g[t(y),0]A(y), 
where g[t(y), 8] and h(y) are nonnegative and h(y) does not depend on 9. 


Theorem 3.4.1. If X,,...,X,y are observations from N(p, 2), then xX and S 
are sufficient for p and &. If s is given, DN. (x, — px, — pw)! is sufficient for 
x. Jf & is given, X is sufficient for p. 


Proof. The density of X,,..., Xx is 
N 


(2) [] n(xclu, 2) 


a=] 


1 : N 
= (Qa) 771) - 7% exp] —dtrE-! YS (x, - p)(2,- p)’ 


= (2m)7 |B] ~ 2" exp{—4[N(¥- w)'E'(F—p) + (N-1) tr B-4s]}. 


The right-hand side of (2) is in the form of (1) for x, S, w, 2%, and the middle 
is in the form of (1) for LY_,(x, - px, — pw)’, Z; in each case A(x,,...,X%y) 
= 1. The right-hand side is in the form of (1) for x, with ACx,,...,%y) = 
exp{— $(N — Ltr 71S}. rT] 


Note that if 2 is given, x is sufficient for p, but if p is given, S is not 
sufficient for &. 


84 ESTIMATION OF THE MEAN VECTOR AND THE COVARIANCE MATRIX 
Completeness 
To prove an optimality property of the T?-test (Section 5.5), we need the 


result that (x, S) is a complete sufficient set of statistics for (uy, 2). 


Definition 3.4.1. A family of distributions of y indexed by ® is complete if 
for every real-valued function gy), 


(3) &8(y) =0 
identically in © implies g(y) = 0 except for a set of y of probability 0 for every ®. 


If the family of distributions of a sufficient set of statistics is complete, the 
set is called a complete sufficient set. 


Theorem 3.4.2. The sufficient set of statistics x, S is complete for wp, X when 
the sample is drawn from N(p, %). 


Proof. We can define the sample in terms of x and z,,...,z, aS in Section 
3.3 with n = N— 1. We assume for any function g(X, A) = g(x, nS) that 


4 vs Kix" e(, E 2%] 
(4) oa) > 
xp 3] OR— wy'2 Ew) + E a2'sal| 
a=1 
dt [|| dz, =0, Vp,2, 
a=l 


where K= ¥N(27r)” #", d¥=T17, dx,, and dz, =T1?., dz;,. If we let 27? 
=I-—2@0, where ©=@' and I-2© is positive definite, and let p= 
(I— 2©)7!t, then (4) is 


5) O=]-: Kir 201s x, 5 22s] 
(5) 0= ff > 
sp Hr 1— 20 Lata +N 
a=l 
-an's-+ne(-20)-'s|| TL de 
a=l 
= [I~ 20]? exp{-3Ne'(1- 20) ~'t} f -- [g(z, B- NR’) 


-exp[tr OB +t'( Nx)]n[Z10,(1/N) 7] si n(z,|0, 1) d¥ I a5 


3.4 THEORETICAL PROPERTIES OF ESTIMATORS OF THE MEAN VECTOR 85 
where B= 1721242, + Nxx’. Thus 
(6) O08 &g(x, B— Nxx')exp[tr OB + t'( Nx)| 


= [--- [g(%, B~ Nzx’) exp[tr OB + 1'( Nz) A(z, B) ded, 


where A(X, B) is the joint density of ¥ and B and dB=T11,.,db,. The 
right-hand side of (6) is the Laplace transform of g(%, B—- Nxx')h(x, B). 
Since this is 0, g(X, A) = 0 except for a set of measure 0. a 


Efficiency 
If a g-component random vector Y has mean vector Y= vw and covariance 
matrix ¢(¥-vXY¥—v)' =, then 


(7) (y-v)W'(y-v)=q+2 


is called the concentration ellipsoid of Y. [See Cramér (1946), p. 300.] The 
density defined by a uniform distribution over the interior of this ellipsoid 
has the same mean vector and covariance matrix as Y. (See Problem 2.14.) 
Let 6 be a vector of g parameters in a distribution, and let ¢ be a vector of 
unbiased estimators (that is, &f=6) based on N observations from that 
distribution with covariance matrix WY. Then the ellipsoid 


(8) n(t~oy's [228 L)( BL so) mq 2 


lies entirely within the ellipsoid of concentration of t; @ log f/0® denotes the 
column vector of derivatives of the density of the distribution (or probability 
function) with respect to the components of 6. The discussion by Cramér 
(1946, p. 495) is in terms of scalar observations, but it is clear that it holds 
true for vector observations. If (8) is the ellipsoid of concentration of ¢, then 
t is said to be efficient. In general, the ratio of the volume of (8) to that of 
the ellipsoid of concentration defines the efficiency of t. In the case of the 
multivariate normal distribution, if @ =p, then x is efficient. If @ includes 
both w and &, then ¥ and S have efficiency [(N — 1)/N}?'4*!’". Under 
suitable regularity conditions, which are satisfied by the multivariate normal 
distribution, 


ERIE) eta 


This is the information matrix for one observation. The Cramér—Rao lower 


86 ESTIMATION OF THE MEAN VECTOR AND THE COVARIANCE MATRIX 


bound is that for any unbiased estimator ¢ the matrix 


a? log fl" 
(10) N&(t—0)(t— 8) -|- #595 
iS positive semidefinite. (Other lower bounds can also be given.) 


Consistency 


Definition 3.4.2. A sequence of ga t= (th,,.-. 
a consistent estimator of 8 =(8,,. 


velnind ¢ 2 Ly es 
Pe) 8, Y if plim, win = Op E=1,.. 


By the law of large numbers each component of the sample mean x is a 
consistent estimator of that component of the vector of expected values p if 
the observation vectors are iidependently and identically distributed with 


mean p, and hence X is a consistent estimator of py. Normality is not 
involved. 


An element of the sample covariance matrix is 


tea N - 
(11) $,,= N—I| ‘¥ (ia = My )(X) oe Te) = Wot 6s = b,)(X, a L,) 
asl 
by Lemma 3.2.1 with b =p. The probability limit of the second term is 0. 
The probability limit of the first term is o,, if x,,x,,... are independently 


and identically distributed with mean p and covariance matrix 2. Then S is 
a consistent estimator of 2. 


Asymptotic Normality 
First we prove a multivariate central limit theorem. 


Theorem 3.4.3. Let the m-component vectors Y,,Y,,... be independently 
and identically distributed with means &Y,=wv and covariance matrices 
é(Y¥,-—vX¥,—v)' =T. Then the limiting distribution of (1/¥n)r".,(¥, — v) 
asn—cois N(O,T). 


Proof. Let 
(12) 6,(t,u) = & exp| iut’ — > (Y,-v)|, 
a= 


where & is a scalar and ¢ an m-component vector. For fixed 1, o,(t, uw) can be 
considered as the characteristic function of (1/¥n)L2_,(t’Y, — &t’Y,). By 


3.4 THEORETICAL PROPERTIES OF ESTIMATORS OF THE MEAN VECTOR 87 


the univariate central limit theorem [Cramér (1946), p. 215], the limiting 
distribution is N(O, t’%#). Therefore (Theorem 2.6.4), 


(13) Tim $,(#,u) =e B” 


for every u and ¢t. (For t=0 a special and obvious argument is used.) Let 
u=1 to obtain 


1 < - 
14 lim ¢ exp| it’ —= Y,- =p ht 
(14) Rais xp = . ( ») 


(ae 


for every t. Since e~ #7 is continuous at t = 0, the convergence is uniform in 
some neighborhood of t= 0. The theorem follows. a 


Now we wish to show that the sample covariance matrix is asymptotically 
nortaally distributed as the sample size increases. 


Theorem 3.4.4. Let A(n) = L'_(X,—-XyXX,—Xy)', where X,,Xz,..- 
are independently distributed according to N(w,%) and n=N-—1. Then the 
limiting distribution of B(n) = (1/¥n J. A(n) — n&] is normal with mean 0 and 
covariances 


(15) ED, (n) Og (n) = O04 + Ty. 


Proof. As shown earlier, A(n) is distributed as A(m) = L2_,Z, Z,, where 


Z,,Z,,-.. are distributed independently according to N(0, 2%). We arrange 
the elements of Z, Zi), in a vector such as 


Zia 
Liao 
(16) Ye | 72 
2 
Loa 


the moments of ¥, can be deduced from the moments of Z, aS given in 
Section 2.6. We have €2,,Zi¢= jj, SLigZjgZpaLZia = Fj Fk + Tie Ft 
FT O(Liq Ziq ~ FijXLpa Zia ~ Tk) = Fi, e+ F%10;,. Thus the vectors ¥, 
defined by (16) satisfy the conditions of Theorem 3.4.3 with the elements 
of wv being the elernents of 2 arranged in vector form similar to (16) 


88 ESTIMATION OF THE MEAN VECTOR AND THE COVARIANCE MATRIX 


and the elements of T being given above. If the elements of A(n) are 
arranged in vector form similar to (16), say the vector W(n), then W(n) — ny» 
=" _,(¥, - v). By Theorem 3.4.3, (1/ vn [W(n) — nv] has a limiting normal 
distribution with mean 0 and the covariance matrix of Y,. | 


The elements of B(x) will have a limiting normal distribution with mean 0 
if x,,X2,... are independently and identically distributed with finite fourth- 
order momcnts, but the covariance structure of B(x) will depend on the 
fourth-order moments. 


3.4.2. Decision Theory 


It may be enlightening to consider estimation in teims of decision theory. We 
review some of the concepts. An observation x is made on a random variable 
X (which may be a vector) whose distribution P, depends on a parameter 6 
which is an element of a set @. The statistician is to make a decision d in a 
set D. A decision procedure is a function 6(x) whose domain is the set of 
values of X and whose range is D. The Joss in making decision d when the 
distribution is P, is a nonnegative function L(@,d). The evaluation of a 
procedure 6(x) is on the basis of the risk function 


(17) R(9,5) = &L[6,5(X)]. 


For example, if d and @ are univariate, the loss may be squared error, 
L(6,d) =(@—d)’, and the risk is the mean squared error &{5(X) - a}. 
A decision procedure 6(x) is as good as a procedure 6*(x) if 


(18) R(8,5) <R( 8, 8*), vo; 


8(x) is better than 6*(x) if (18) holds with a strict inequality for at least one 
value of 9. A procedure 5*(x) is inadmissible if there exists another proce- 
dure 6(x) that is better than 6*(x). A procedure is admissible if it is not 
inadmissible (i.e., if there is no procedure better than it) in terms of the given 
loss function. A class of procedures is complete if for any procedure not in 
the class there is a better procedure in the class. The class is minimal 
complete if it does not contain a proper complete subclass. If a minimal 
complete class exists, it is identical to the class of admissible procedures. 
When such a class is available, there is no (mathematical) need to use a 
procedure outside the minimal complete class. Sometimes it is convenient to 
refer to an essentially complete class, which is a class of procedures such that 
for every procedure outside the class there is one in the class that is just as 
good. 


3.4 THEORETICAL PROPERTIES OF ESTIMATORS OF THE MEAN VECTOR 89 


For a given procedure the risk function is a function of the parameter. If 
the parameter can be assigned an a priori distribution, say, with density p(4), 
then the average loss from use of a decision procedure 5(x) is 


(19) r( p,6) = &,R(6,5) = &, &L[6, 6(X)]. 


Given the a priori density p, the decision procedure 6(x) that minimizes 
r( p, 5) is the Bayes procedure, and the resulting minimum of r( p, 6) is the 
Bayes risk. Under general conditions Bayes procedures are admissible and 
admissible procedures are Bayes or limits of Bayes procedures. If the density 
of X given 6 is f(x|@), the joint density of X and @ is f(x| @)p(@) and the 
average risk of a procedure 5(x) is 


(20) r(p.6)= ff Ela, 8(x)]f(418) o( 8) dedo 


=f {f L106 )]ecalx) a0} f(x) dx; 


here 
(21) flx)=f flxloyo(ayae, — g(alxy = FA) 
e f(x) 


are the marginal density of X and the a posteriori density of 6 given x. The 
procedure that minimizes r(p,6) is one that for each x minimizes the 
expression in braces on the right-hand side of (20), that is, the expectation of 
Ll, 6(x)] with respect to the a posteriori distribution, If 8 and d are vectors 
(6 and d) and L(0,d) = (6 — d)'Q(6 — d), where Q is positive definite, then 


(22) &q,L[0,d(x)] = & [8 ~ €(@lx)]’Q[6 — #(8lx)] 
+[ &(0|x) —d(x)]'Q| &(@lx) — d(x)]. 


The minimum occurs at d(x) = &(6|x), the mean of the a posteriori distribu- 
tion. 


Theorem 3.4.5. [fx,,.-.,Xy are independently distributed, each x, accord- 


ing to Np, %), and if p has an a priori distribution N(w,®), then the a 
posteriori distribution of p given x,,...,Xy is normal with mean 


23 o(@+ts) z+4s(o+4s) | 
(23) (o+ 5] K+ | + v 
and covariance matrix 


(24) &-olo+ 


90 ESTIMATION OF THE MEAN VECTOR AND THE COVARIANCE MATRIX 


Proof Since x is sufficient for , we need only consider x, which has the 
distribution of pt+v, where v has the distribution N[0,(1/N)2] and is 
independent of p. Then the joint distribution of yp and x is 


v 


@ ® 
(25) Nn}. . 4 
®@ +> 


vy 


The mean of the conditional distribution of p given x is (by Theorem 2.5.1) 


AS oa 
(26) y+ (+ we (xX-v), 
which reduces to (23). a 
Corollary 3.4.1. Jf x,,...,, are independently distributed, each x, ac- 


cording to N(x, 2), & has an a priori distribution N(v, ®), and the loss function 
is (d — .)'Q(d — 2), then the Bayes estimator of is (23). 


The Bayes estimator of « is a kind of weighted average of x and v, the 
prior mean of p. If (1/N)2 is small compared to ® (e.g., if N is large), v is 
given little weight. Put another way, if ® is large, that is, the prior is 
relatively uninformative, a large weight is put on Xx. In fact, as ® tends to oo 
in the sense that ®~' — 0, the estimator approaches x. 

A decision procedure 6)(x) is minimax if 


(27) sup R( 6, 6)) = inf sup R(6, 5). 
@ § 6 


Theorem 3.4.6. If x,,...,%j, are independently distributed each according 
to N(w, 2) and the loss function is (d — p)'Q(d — 1), then ¥ is a minimax 
estimator. 


Proof. This follows from a theorem in statistical decision theory that if a 
procedure 5) is extended Bayes [i.e., if for arbitrary ¢, r( p, 6)) <r(p, 6,) + « 
for suitable p, where 6, is the corresponding Bayes procedure] and if 
R(@, 5,) is constant, then 59 is minimax. [See, e.g., Ferguson (1967), Theo- 
rem 3 of Section 2.11.] We find 


(28) R(p, x) = &(¥— p)'Q(X—- p) 
=€trQ(i-—p)(x-p)’ 


= Vir Qt. 


3.5 IMPROVED ESTIMATION OF THE MEAN 91 


Let (23) be d(x). Its average risk is 


(29) & & {tr O[ d(x) — p][d(¥) — p] lz} 


= é,tr@ ®-o[@+ x) @| =r90(o+43) fs 


-ro(r+tse 1s .1pes 
-tr9/ +H ) Ne 7 Nt 


as @-! 9, a 


For more discussion of decision theory see Ferguson (1967). DeGroot 
(1970), or Berger (1980b). 


3.5. IMPROVED ESTIMATION OF THE MEAN 


3.5.1. Introduction 


The sample mean x seems the natural estimator of the population mean p 
based on a sample from N(p, 2). It is the maximum likelihood estimator, a 
sufficient statistic when { is known, and the minimum variance unbiased 
estimator. Moreover, it is equivariant in the sense that if an arbitrary vector v 
is added to each observation vector and to p, the error of estimation 
(x+¥v)—(p+v)=x-p is independent of v; in other words, the error 
does not depend on the choice of origin. However, Stein (1956b) showed the 
startling fact that this conventional estimator is not admissible with respect to 
the loss function that is the sum of mean squared errors of the components 
when & =J and p > 3. James and Stein (1961) produced an estimator which 
has a Smaller sum of mean squared errors; this estimator will be studied in 
Section 3.5.2. Subsequent studies have shown that the phenomenon is 
widespread and the implications imperative. 


3.5.2. The James-Stein Estimator 


The loss function 
a 2 2 
(1) L(u,m) =(m- p)'(m— pw) = L (m,~ 4)’ = ll pl 
i=] 


is the sum of mean squared errors of the components of the estimator. We 
shall show [James and Stein (1961)] that the sample mean is inadmissible by 


92 ESTIMATION OF THE MEAN VECTOR AND THE COVARIANCE MATRIX 


displaying an alternative estimator that has a smaller expected loss for every 
mean vector ys. We assume that the normal distribution sampled has covari- 
ance matrix proportional to J with the constant of proportionality known. It 
will be convenient to take this constant to be such that Y= (1/N)L"_,X, =X 
has the distribution N(p, J). Then the expected loss or risk of the estimator 


Y is simply £||¥ — pll? = tr =p. The estimator proposed by James and Stein 
is (essentially) 


(2) my) ={1~= 22, ) (y=) 4y, 


lly — ll 


where w is an arbitrary fixed vector and p> 3. This estimator shrinks the 
observed y toward the specified ». The amount of shrinkage is negligible if y 
is very different from yw and is considerable if y is close to w. In this sense » 
is a favored point. 


Theorem 3.5.1. With respect to the loss function (1), the risk of the estima- 
tor (2) is less than the risk of the estimator Y for p > 3. 


We shall show that the risk of Y minus the risk of (2) is positive by 
applying the following lemma due to Stein (1974). 


Lemma 3.5.1. Jf f(x) is a function such that 
(3) f(b) ~ F(a) = fF (a) 
for all a and b (a <b) and if 


ene Ls shige 
(4) fF OOl gee dx < 00. 


then 


en a BY oy 


(5) ff a~ ee dee fF TE 


3.5 IMPROVED ESTIMATION OF THE MEAN 93 


Proof of Lemma. We write the left-hand side of (5) as 


en 4(x-6)? dx 


(5) ne 


+f" [A —f(0)|(x- 8) sere 


= se ah ’ = 1 -4a-a7 
[LP OVE ee dyde 


Oe sar oer me ee eee ny 
=i Prove US aig © dx dy 


7] y 1 ae 
=f tyes Tan’ Mx-9y dedy, 


which yields the right-hand side of (5). Fubini’s theorem justifies the inter- 
change of order of integration. (See Problem 3.22.) = 


The lemma can also be derived by integration by parts in special cases. 


Proof of Theorem 3.5.1. The difference in risks is 


(7) AR(w) = &{II¥— wl? —Iim(y) — wll} 


= & \|l¥— pl’ -i{ [1 - ; 


en eee 
~4 (yews by Joa) ae 


P- 
ly — vl? 


lly - Fe 


94 ESTIMATION OF THE MEAN VECTOR AND THE COVARIANCE MATRIX 


Now we use Lemma 3.5.1 with 


7 ] 2 it : 

(8) Ay) = py = 

LO : iy L(y ~%) [Eom 

ra fe he 
[For p = 3 the condition (4) is satisfied.) Then (7) is 

(y= 1) | (p-2)* 

9) AR = 6,2 =e) a cs NO es 

= me: 1 

(p-2) kets >0. " 


This theorem states that } is inadmissible for estimating when p > 3, 
since the estimator (2) has a smaller risk for every ys (regardless of the choice 
of +). 

The risk is the sum of the mean squared errors (mY) — y,}’. Since 
Y,....,Y, are independent and only the distribution of Y, depends on ,,, it is 
puzzling that the improved estimator uses all the Y,’s to estimate y,; it seems 
that irrelevant information is being used. Stein explained the phenomenon by 
arguing that the sample distance squared of Y from vy, that is, IY — vll?, 
overestimates the squared distance of ys from v and hence that the estimator 
Y could be improved by bringing it nearer v (whatever v is). Berger (1980a), 
following Brown, illustrated by Figure 3.4. ‘Che four points x,,%,,%3, %4 
Tepresent a spherical distribution centered at jw. Consider the effects of 
shrinkage. The average distance of m(x,) and m(x,) from wp is a little greater 
than that of x, and x,, but m(x,) and m(x,) are a little closer to w than x, 
and x, are if the shrinkage is a certain amount. If p = 3, there are two more 
points (not on the line v, qs) that are shrunk closer to w. 


m(x4) 
~~, es 


Figure 3.4. Effect of shrinkage. 


3.5 IMPROVED ESTIMATION OF THE MEAN 95 


The risk of the estimator (2) is 


1 
10 E,llm(Y) — wll’ =p - (p-2)'& ~—,, 
where ||[Y~ ||’ has a noncentral y?-distribution with p degrees of freedom 
and noncentrality parameter || — v||’. The farther p is from v, the less the 
improvement due to the James-Stein estimator, but there is always some 


improvement, The density of ||Y— v||* = V, say, is (28) of Section 3.3.3, where 
7? = || — v7, Then 


1 


11 & ———+ = V7! 
( ) lly — pl? 2 
152 1 ] fora] F 
Pa 4 Bir +B) | 
Lote. [Eee 
ey BIT (zp +B) 
8 
mle tS & eee eel 
poo?) BYgp+B-1) 


for p > 3. Note that for p =v, that is, r? = 0, (11) is 1/(p — 2) and the mean 
squared error (10) is 2. For large » the reduction in risk is considerable, 

Table 3.2 gives values of the risk for p= 10 and a’ =1. For example, if 
7? = ||p— v||? is 5, the mean squared error of the James~Stein estimator is 
8.86, compared to 10 for the natural estimator; this is the case if p;— v,= 
1/¥2 = 0.707, i= 1,...,10, for instance. 


Table 3,2', Average Mean Squared Error of the 
James-—Stein Estimator for p = 10 and a’? = 1 


a? = |[p — v|? E,llm(¥) — wl? 
0.0 2,00 
0.5 4,78 
1.0 6.21 
2.0 75] 
3.0 8.24 
4.0 8.62 
5.0 8.86 
6.0 9.03 


'Erom Efron and Morris (1977). 


96 ESTIMATION OF THE MEAN VECTOR AND THE COVARIANCE MATRIX 


An obvious question in using an estimator of this class is how to choose 
the vector v toward which the observed mean vector is shrunk; any v yields 
an estimator better than the natural one. However, as seen from Table 3.2, 
the improvement is small if || — ll is very large. Thus, to be effective some 
knowledge of the position of w is necessary. A disadvantage of the procedure 
is that it is not objective; the choice of v is up to the investigator. 

A feature of the estimator we have been studying that seems disadvanta- 
geous is that for small values of ||Y— vl, the multiplier of Y— wv is negative; 
that is, the estimator m(Y) is in the direction from v opposite to that of Y. 
This disadvantage can be overcome and the estimator improved by replacing 
the factor by 0 when the factor is negative. 


Definition 3.5.1. For any function g(u), let 

(12) g*(uy=stu), — a(u) 20, 
=0, g(u) <0. 

Lemma 3.5.2, When X is distributed according to N(p, 1), 
(13) &{lle* (XIX — wl?} < &{llgCXI)X — wll}. 

Proof. The right-hand side of (13) minus the left-hand side is 
(14) &{e7(UX INI? — [g* (IXI)] XI} = 0 
plus 2 times 


(15) & p'X{ 2 (Xl) — (IX) 


= lof ~- f_vtet Ul ~ g(llyll)] 


P 
on _ | iz yy — 2y,|lpll + {Iweil? 


f=I 


| 


where y’ =x’P, (Ilull,0,...,0) = p’P, and PP’ = J. [The first column of P is 
(1/||p|Dp.] Then (15) is ||pll times 


“(Qn)” 


(16) etme fT fo fle dis ~ g(\lyll)] fetus —e-lmln] 


; 1 
(20)* 


etter y? dy, dy, ane dy, > 0 


(by replacing y, by —y, for y, <0). a 


3.5 IMPROVED ESTIMATION OF THE MEAN 97 


Theorem 3.5.2. The estimator 


(17) m*(y)=[1- Po) (y-v)+» 


has smaller risk thar, m(y) defined by (2) and is minimax. 


Proof. In Lemma 3.5.2, let g(u)=1—(p-—2)/u? and X=Y-y», and 
replace y by yw-—yv. The second assertion in the theorem follows from 
Theorem 3.4.6. a 


The theorem shows that m(Y) is not admissible. However, it is known that 
m*(Y) is also not admissible, but it is believed that not much further 
improvement is possible. 

This approach is easily extended to the case where one observes X),...,X%y 
from N(p, 2) with loss function L(y, m) =(m ~ p)'Z7'Gn — p). Let L = 
CC’ for some nonsingular C, x,=Cx*, a=1,...,N, wp=Cp*, and 
L* (m*, p*) = |lm* — p* ||’. Then x*,...,x% are observations from N(p*. 1), 
and the problem is reduced to the earlier one. Then 


+ 
2 
18 1- ee K¥-v)t+yp 
(18) | WEEE | ( ) 
is a Minimax estimator of w. 
3.5.3, Estimation for a General Known Covariance Matrix and an 


Arbitrary Quadratic Loss Function 


Let the parent distribution be N(y, 2%), where 2 is Known, and let the loss 
function be 


(19) L(p,m) = (m—- p)'Q(m— p), 


where Q is an arbitrary positive definite matrix which reflects the relative 
importance of errors in different directions. (If the loss function were 
singular, the dimensionality of x could be reduced so as to make the loss 
matrix nonsingular.) Then the sample mean ¥ has the distribution 
N(p,(1/N)=) and risk (expected loss) 


(20) &(#-w)'Q(F~H) = 6 O(F~ H)(E~ BW)’ xtres, 


which is constant, not depending on p. 


98 ESTIMATION OF THE MEAN VECTOR AND THE COVARIANCE MATRIX 


Several estimators that improve on X¥ have been proposed. First we take 
up an estimator proposed independently by Berger (1975) and Hudson 
(1974). 


Theorem 3.5.3. Letr(z),. 0 <2 < cc, be a nondecreasing differentiable func- 
tion such that 0 <1(z) < 2 p — 2). Then for p = 3 


r(N?(¥-v)'X QTE '(¥—-yv)) 
N(X-v)' XQ rE '(e-v) 


(21) m=|I- Q'S 'Wa-v)t+v 


has smaller risk than X and is minimax. 


Proof. There exists a matrix C such that C'‘QC =I and (1/N)X% =CAC' 
where A is diagonal with diagonal elements 5, > 5,2 +: > 6, > 0 (Theorem 
A.2.2 of the Appendix), Let x= Cy+v and w=Cp*+v. Then y has the 
distribution N(w*, A), and the transformed loss function is 


(22) L* (m* pt) = (i ~ ph)! Cm — wt) = [l* = we I?. 
The estimator (21) of p is transformed to the estimator of p* = C7'(p — v), 
'A72 
(23) m*(y) = | I— aes y- 
y'Av’y 
We now proceed as in the proof of Theorem 3.5.1. The difference in risks 
between y and m* is 
(24) 
AR( wt) = 8. (IY — wll? = ll (YX) — wll’) 


4, y Lycy — px) (Vay) \ 


: yA-*Y 


Since r(z) is differentiable, we use Lemma 3.5.1 with (x — @) =(y;— ut)6, 
and 


_ t(y'An*y) 
yiA~*y 
r(y’A?y) , 2r'(y'A7y) yh 2r(y'A~?y) yp 


y'Av*y y'A-?y 5, (y'An2yy 52° 


(25) f(y.) 


i? 


(26) f(y) = 


3.5 IMPROVED ESTIMATION OF THE MEAN 99 
Then 


(27) 


r(Y'A~"Y bic oe r?(Y'A7*Y 
AR(p*) = 40-9 + 4r (Y A *y) aa oe | 20 


since r(y'A~?y) < 2(p — 2) and r’(y’A7?y) > 0. a 


Corollary 3.5.1, Forp23 
(28) 


min| p—2,N°(¥-v)'S'Q's-(x-v)] 
(1 Ea EG 


has smaller risk than X and is minimax. 


Proof. the function r(z)= min(p — 2,z) is differentiable except at z= 
p — 2. The function r(z) can be approximated arbitrarily closely by a differ- 
entiable function. (For example, the corner at z = p — 2 can be smoothed by 


a circular arc of arbitrary small radius.) We shall not give the details of the 
proof. a 


In canonical form y is shrunk by a scalar times a diagonal matrix. The 
larger the variance of a component is, the less the effect of the shrinkage. 

Berger (1975) has proved these results for a more general density, that is, 
for a mixture of normals. Berger (1976) has also proved in the case of 
normality that if 


a 1 1 
Zz yew retl p- yur du 
(29) (z= 
f weP-e eM? diy 
a 


for 3-— 4p <c <1+ 4p, where a is the smallest characteristic root of £Q, 
then the estimator m given by (21) is minimax, is admissible if ¢ <2, and is 
proper Bayes if ¢ <1. 

Another approach to minimax estimators has been introduced by Bhat- 
tacharya (1966). Let C be such that C7'(1/N)2(C-') =I and C’QC=Q"*, 
which is diagonal with diagonal elements qf > qj 2 + 2g’ >0. Then y= 


100 ESTIMATION OF THE MEAN VECTOR AND THE COVARIANCE MATRIX 
C~'X has the distribution N(w*, 1), and the loss function is 


P 
Lat(mi ~ ty 


P 


» ae j(mt — wt) i 


i=l] fei 


(30) L*(m* , w*) 


P 

ys a r (mj — 
JBI i=] 

4 2 

y ajllm*? — ac ; 
j=l 

where a=; ~Girl = lwp 1, @,=¢,, m*) =(mq,...,mFf)’, and 
pr = (ut,..., uf)’, f=l,...,p. This decomposition of the loss function 
suggests combining minimax estimators of the vectors p*, j =1,..., p. Let 
YD = (yyy) 


I 


Theorem 3.5.4. [f h'(y) = [AY y™),..., AVPCyP I’ is @ minimax esti- 
mator of p*) under the loss function \|m* — p*|[?, j =1,..., p, then 


1 < 
(31) a DL ahi), P= 1jsscpy 
j=l 


is a minimax estimator of Ut, ..-, Lp- 
Proof. First consider the randomized estimator defined by 
Defy) ae ae 
(32) Pr{G,Cy) = AYP (y eer a JRinccyD, 
for the ith component. Then the risk of this estimator is 


P 2 
(33) 2 at (GY) ~ wh] Sa ya [AP(Y) — ut] 


f=] yard 


I] 
M-~ ae Le 


I 
a ye [CY) — ab]! 


a, Es |RPCY™) — pr P|? 


P 
<Laj= Liq 
j=l j=l 

* 
= &L*(Y, pt)", 


and hence the estimator defined by (32) is minimax. 


36 ELLIPTICALLY CONTOURED DISTRIBUTIONS 101 


Since the expected value of G,(Y) with respect to (32) is (31) and the loss 
function is convex, the risk of the estimator (1) is less than that of the 
randomized estimator (by Jensen’s inequality). a 


3.6. ELLIPTICALLY CONTOURED DISTRIBUTIONS 


3.6.1. Observations Elliptically Contoured 


Let x,,-..,Xy he N (=n + 1) independent observations on a random vector 
X with density | Al~ *gl(x, ~v)'Aq'(x, —w)}. The density of the sample is 


N 
(1) |A\~ 2 [Lal(x—¥yAn'(x—»)]. 


The sample mean X and covariance matrix S =(1/nON_ (x, — wx, — w)’ 
-- N(X¥ — XX — p)’] are unbiased estimators of the mean w=v and the 
covariance matrix = =[/'R*/p]A, where R?=(x~-)'A7'(x—- ). 


Theorem 3.6.1. The covariances of the mean and covariance of ua sainple of 
N from | A\~ 2gf(x —~v)'A'\(x-—y)] with &R4 < 00 are 


A Z , 1 
(2) é(3—p)(E- py =H. 
(3) é(s,,—- 0,,)(X¥- p) =9, i,j=l....,p. 
K 
(4) &(s,, a 04, ) (Sy. > O.;) = W 6% Ft +O, 04+ 7,0.) 


1 a 
+E (Ou Fy + 1%)» i,j KP le. DP. 
Lemma 3.6.1. The second-order moments of the elements of S are 


(5) OS, Sy) = %) Or + “( ip Ty + yO) + + Gj, Fy t+ O44 + O10), 
i ee eee 
Proof of Lemma 3.6.1. We have 
N 
(6) OR Fe Me BV a~ MIG Hd) 
=NE(Xig— by) (Xa 7 By) (Moa 7 He) Oia > Ha) 
+N(N-1) @(4,0- By) (XQ ~ By) F(X g ~ BCX ip > HD) 


=N(1L+ k)(0,, 04) + Oj. 0%) + 6,0.) + NCN~ 1) 0,04), 


102 ESTIMATION OF THE MEAN VECTOR AND THE COVARIANCE MATRIX 


(7) EN*(E, — a) Ey Om oa) (2) - Hr) 


Pate: 
nie ys (X10 Br) CX y — By) (Mey > By) (X15 > Hr) 
a. Boy. d=1 


] 
WU + K)( 0,04) + OG + OF) 


-] 
+ N (8, 04+ O40), + Osi Tc) 


N N 
; 1 
(8) é 2 (a7 Ham By) xy (Xi. — Be) (%, — B,) 
ae l B.y=) 
= (Lt e)( a, 04, + O%¢% + 9,0.) + CN ~ 1) Oy, Or a 


It will be convenient to use more matrix algebra. Define vec B, B ® C (the 
Kronecker product), and K,,,,, (the commutator matrix) by 


by 
(9) vec B=vec(b,,...,5,)=| ° |, 
b, 
byC b,,€ 
(10) Bec=| : ole 
at oan be 
(11) K,,, vec B= vec B’. 


See. e.g., Magnus and Neudecker (1979) or Section A.5 of the Appendix. We 
can rewrite (4) as 
(!2) @(vec S$) = &(vec S$ — vec £ )(vec S — vec 2)’ 


= TREN Up +Kyp)(E@E) + vec E (vec B)'. 


Theorem 3.6.2 


vec § — vec £ 


d 0\ (=z 0 1 
= Gee (x+1)(L,:+K,,)(2 OZ) + « vec L(vec Ed)’ | 


3.6 ELLIPTICALLY CONTOURED DISTRIBUTIONS 103 


This theorem follows from the central limit theorem for independent 
identically distributed random vectors (with finite fourth moments). The 
theorem forms the basis for large-sample inference. 


3.6.2, Estimation of the Kurtosis Parameter 


To apply the large-sample distribution theory derived for normal distribu- 
tions to problems of inference for elliptically contoured distributions it is 
necessary to know or estimate the kurtosis parameter x. Note that 


(4) é((x-wyE(x- w))=(25) cerry’ 
per’ =P(p+2)(1+x«). 


~ (6R*) 


Since %45 wp and s45y, 
ae ree -.7]2 P 
(15) WoL (4a~¥)'S "(24 -%)] > p(p + (1+ «). 
a=! 
A consistent estimator of « is 
(16) pe TS apse ee 
(p+2) NO o 2 . 


Mardia (1970) proposed using M to form a consistent estimator of x. 


3.6.3, Maximum Likelihood Estimation 


We have considered using S$ as an estimator of % =(&R?/p)A. When the 
parent distribution is normal, S is the sufficient statistic invariant with 
respect to translations and hence is the efficient unbiased estimator. Now we 
study other estimators. 

We consider first the maximum likelihood estimators of ~ and A when 
the form of the density g(-) is known. The logarithm of the likelihood 
function is 


N N 
(17) logL = ~FloglAl + ¥ log g[(x,- p)'A'(x,- #)]. 


a=] 


104 ESTIMATION OF THE MEAN VECTOR AND THE COVARIANCE MATRIX 


The derivatives of log L with respect to the components of w are 


g’[(%_— B)IAT (x, = 
g[(x_— w)' AT! (x, — w)] 


ae 


(18) Dy 


AON x= bw). 


Setting the vector of derivatives equal to 0 leads to the equation 
(19) 
 g'[(ta~ BAe] g'[(ee~ yA 2, 8] 
A 


= g[(x.- f)'A“(x, - i) x i= g[(x.~ A)'AT'(x. =f) 


Setting equal to 0 the derivatives of log L with respect to the elements of 
A7! gives 


~ 2 & a'[(z.— yA i] 
(2) ASW EG. Wael 


The estimator A is a kind of weighted average of the rank 1 matrices 
(x, — fix, — fp)’. In the normal case the weights are 1/N. In most cases 
(19) and (20) cannot be solved explicitly, but the solution may be approxi- 
mated by iterative methods. 

The covariance matrix of the limiting normal distribution of YN (vec A — 
vec A) is 


(x a — ft) ( Xq— ph)’. 


(21) @(vec A) = 0,,(1,:+K,,)(A® A) + 02, vec A(vec A)’, 


where 
(22) oe Seer a 
& g'(R’) 2 
Ocom | 
_ 20,,(1 - O12) 
(23) Ox, = 24 play) 
See Tyler (1982). 


3.6.4. Elliptically Contoured Matrix Distributions 
Let 


(24) Y= 


3.6 ELLIPTICALLY CONTOURED DISTRIBUTIONS 105 


be an NXp random matrix with density g(¥’¥)=g(L%_, yy). Note that 
the density g(¥’Y) is invariant with respect to orthogonal transformations 
Y* =Q,,Y. Such densities are known as /eft spherical matrix densities. An 
example is the density of N observations from N(O, I,), 


oo he, 
(2a)? 


~duy'y 


(25) g(Y'Y) = e 


In this example Y is also right spherical: YO, = Y. When Y is both left 
spherical and right spherical, it is known as spherical. Further, if Y has the 
density (25), vec Y is spherical; in general if Y has a density, the density is of 
the form 

Np 
(26) g(t ¥'Y) == aes re =g(tr YY’) 
a=li=t 
=gl|(vec Y)’vec ¥] =g[(vec ¥’)' vec ¥']. 
We call this model vector-spherical. Define 
(27) X=YC'+eyw', 


where C'A7'C=I, and ey =(1,...,1). Since (27) is equivalent to ¥= 
(X~e,p’XC')7! and (C’)"'C7'=A7!, the matrix X has the density 


(28) JAI" 7g [tr X— eye )AT'(X—eye')'| 
N 
=A" 7g) (x, — BAT (x, — ) |. 
a=] 


From (26) we deduce that vec Y has the representation 
(29) vec Y= RvecU, 

where w = R? has the density 

qriNp 


(30) TUNp72y” 80), 


vec U has the uniform distribution on L¥,,2?.,u7, = 1, and R and vecU are 
independent. The covariance matrix of vec Y is 


R?2 2 
(31) 8 veo ¥(vec¥)' = Sty, = Spe (I, @ hs). 


106 ESTIMATION OF THE MEAN VECTOR AND THE COVARIANCE MATRIX 


Since vec FGH = (H' ® F)vec G for any conformable matrices F, G, and H, 
we can write (27) as 


(32) vec X= (C@l,)vec¥Y+ p® ey. 

Thus 

(33) & vec X=pPOeEx, 

(34) @(vec X) = (C@Ily) (vec Y)(C' Bly) = OE A aly, 


(35) &é(row of X) =p’, 


& R? 
36 €@ (row of X') = —— A. 
(36) (row ) Np 


The rows of X are uncorrelated (though not necessarily independent), From 
(32) we obtain 


(37) vecX=R(C@ly)vecU+ p@ey, 
(38) X£RUC' +e,p'. 


Since X— e,p! =(X—€,F') + £(X — w)' and e(X — €,F%') =0, we can 
write the density of X as 


(39) AIT" @g[tr AT!(x ~ ey ¥')'(X— ey") + N(E— py’ A"(F~ p)], 


where ¥=(1/N)X‘e,. This shows that a sufficient set of statistics for x and 
A is ¥ and nS =(X ~ eyX')'(X — €,X'), as for the normal distribution. The 
maximum likelihood estimators can be derived from the following theorem, 
which will be used later for other models. 


Theorem 3.6.3. Suppose the m-component vector Z has the density 
[Dl ~ tAl(z — v)'@-"(z — v)], where w?™h(w) has a finite positive maximum at 
w, and ® is a positive definite matrix, Let 1 be a set in the space of (v, ®) 
such that if (v,®) EQ then (v,c®) €Q for all ¢ > 0. Suppose that on the 
basis of an observation z when h(w) = const e7 2” (i.e., Z has a normal 
distribution) the maximum likelihood estimator (¥, ®) © Q. exists and is unique 
with ® positive definite with probability 1. Then the maximum likelihood 
estimator of (v,®) for arbitrary h(-) is 


3.6 ELLIPTICALLY CONTOURED DISTRIBUTIONS 107 


and the maximum of the likelihood is |@\~ *h(w,) [Anderson, Fang, and Hsu 
(1986)}. 


Proof. Let W=|@|~'/"@ and 


si cn ce gt A Cee) 
(1) d= (z-vy''(z-v) = GPS) 


Then (v, @) EXD, and |W] = 1. The likelihood is 


(42) [(2—v)' H(z wv) aima(d). 


Under normality h(d) = (27)~ Pe 7 and the maximum of (42) is attained 
at v=p, B= = |b) -!/"@, and d=m. For arbitrary A(-) the maximum 
of (42) is attained at =, B=B, and d=w,. Then the maximum likeli- 
hood estimator of ® is 


av" = 
| @| 7a ®- 


(43) b= (o/b = 


Then (40) follows from (43) by use of (41). 2 
Theorem 3.6.4. Let X (N Xp) have the density (28), where w?*?g(w) has 


@ finite positive maximum at w,. Then the maximum likelihood estimators of 
and A are 


eee, Ae 
(44) p=x, A=— A, 


where A=LN_ Xx, -— Ex, -—¥)'. 


Corollary 3.6.1. Let X (NX p) have the density (28). Then the maximum 
likelihood estimators of v, CAymacy Ayads and p,, t,f=1,...,p, are x, 


(p/w May .--.4pp) and a, / Vai4,;,i,7= L...,p. 
Proof. Corollary 3.6.1 follows from Theorem 3.6.3 and Corollary 3.2.1. @ 


Theorem 3.6.5. Let f(X) be a vector-valued function of X (N Xp) such 
that 


(45) f(X+eyv') =fCX) 


108 ESTIMATION OF THE MEAN VECTOR AND THE COVARIANCE MATRIX 


forall y und 
(46) F(X) =f(X) 


for all c. Then the distribution of f(X) where X has an arbitrary density (28) is 
the same as its distribution where X has the normal density (28). 


Proof. Substitution of the representation (27) into f(X) gives 
(47) F(X) =F (VC + eye’) =f(YC") 
by (45). Let f(X) = h(vec X). Then by (46), A(cX) = kCX) and 


(48) f(YC') =h[(C @1y) vec Y] =h[ R(C@ly)vecU| 
=h{(C@1,)vecU]. | 


Any statistic satisfying (45) and (46) has the same distribution for all g(-). 
Hence, if its distribution is known for the normal case, the distribution is 
valid for all elliptically contoured distributions. 

Any function of the sufficient set of statistics that is translation-invariant, 
that is, that satisfies (45), is a function of S. Thus inference concerning & can 
be based on S$. 


Corollary 3.6.2. Let f(X) be a vector-valued function of X (N Xp) such 
that (46) holds for all c. Then the distribution of f(X) where X has arbitrary 
density (28) with p= 0 is the same as its distribution where X has normal density 
(28) with p= 0. 


Fang and Zhang (1990) give this corollary as Theorem 2.5.8. 


PROBLEMS 


3.1, (Sec. 3.2) Find ji, 2, and ( B,,) for the data given in Table 3.3, taken from 
Frets (1921). 


3.2, (Sec. 3.2) Verify the numerical results of (21). 


3.3, (Sec. 3.2) Compute ji, %, S, and p for the following pairs of observations: 
(34, 55), (12, 29), (33, 75), (44, 89), (B9, 62), (59, 69), (50, 41), (88, 67). Plot the obser- 
vations. 


3.4. (Sec. 3.2) Use the facts that |C*| = T1A,,  C* = DA,, and C* =T/ if Ay= 
=A, =1, where 4,,...,A, are the characteristic roots of C*, to prove Lemma 
3,2,2. [ Hint: Use f as given in (12).] 


PROBLEMS 109 


Table 3.3', Head Lengths and Breadths of Brothers 


Head Head Head Head 
Length, Breadth, Length, Breadth, 
First Son, First Son, Second Son, Second Son, 
xy Xa Xy X4 
194 155 179 145 
195 149 201 152 
181 148 185 149 
183 153 188 149 
176 144 171 142 
208 157 192 152 
189 150 190 149 
197 159 189 152 
188 152 197 159 
192 150 187 151 
179 158 186 148 
183 147 174 147 
174 150 185 152 
190 159 195 157 
188 151 187 158 
163 13" 161 130 
195 155i 183 158 
186 153 173 148 
181 145 182 146 
175 140 165 137 
192 154 185 152 
174 143 178 147 
176 139 176 143 
197 167 200 158 
190 163 187 150 


'These data, used in examples in the first edition of this book, came from Rao 
(1952), p. 245. Izenman (1980) has indicated some entries were apparently 
incorrectly copied from Frets (1921) and corrected them (p. 579), 


3.5. (Sec, 3.2) Let x, be the body weight (in kilograms) of a cat and x, the heart 
weight (in grams). [Data from Fisher (1947b).] 


(a} In a sample of 47 female cats the relevant data are 


sx - {1109 Sx xi =| 265-13 1029.62 
a%a™\ 1029.62 4064.71 


Find f., 5, S, and A. 


110 ESTIMATION OF THE MEAN VECTOR AND THE COVARIANCE MATRIX 


Table 3.4. Four Measurements on Three Species of Iris (in centimeters) 


{ris setosa Tris versicolor 


Iris virginica 
Sepal Sepal Petal Petal | Sepal Sepal Petal Petal 
length width length width | length width length width 


Sepal Sepal Petal Petal 
length width Jength width 


49 3.1 1.5 0.1 5.2 2.7 3.9 1.4 72 3.6 6.1 2.5 
5.4 37 1.5 0,2 5.0 2.0 3.5 1.0 65 3.2 5.1 2.0 
48 3.4 1.6 0,2 5.9 3.0 42 15 6.4 2.7 5.3 1.9 
4.8 3 1.4 0.1 6.0 2.2 4.0 1.0 6.8 3.0 5.5 2.1 


5,7 4.4 1.5 0.4 6.7 31 4.4 1.4 6.4 3.2 53 2.3 
54 3.9 1.3 0.4 5.6 3.0 45 15 6.5 3.0 5.5 1.8 
5.1 3,5 1.4 0.3 5.8 2.7 4,1 1.0 7.7 3.8 6.7 2.2 


4.7 2 1.6 0,2 5.7 2.6 3.5 1,0 7.2 3.0 5.8 1.6 
4.8 3,1 1.6 0.2 5.5 2.4 38 1. 7.4 2.8 6. | 1.9 
5.4 3.4 1.5 0,4 5.5 2.4 37 1.0 7.9 3.8 6.4 2.0 
5.2 41 1.5 0.1 5.8 2,7 3.9 1,2 6.4 2.8 5.6 2.2 
5.5 4,2 1.4 0.2 6.0 Red: 5.1 1.6 6.3 2.8 5.1 1.5 
49 3.1 1.5 0.2 5.4 3.0 45 15 6.1 2.6 5.6 1.4 


PROBLEMS 111 


Table 3.4, (Continued) 


Sepal Sepal Petal Petal 
length width 


3.6. 


3.7. 


3.8. 


3.9. 


Iris setosa Tris versicolor Tris virginica 
Sepal Sepal Petal Petal 
length width length width 


6.7 3.1 5.6 2.4 
6.9 3.1 5.1 2.3 
5.8 2.7 5.1 1.9 


Sepal Sepal Petal Petal 
length width length width 


length width 


3.8 19 04 | 5.6 2.7 42 130, 67 3.3 37 2.5 


3.0 1.4 0.3 5.7 3.0 4.2 1.2 6.7 3.0 5.2 2.3 
3.8 1.6 0.2 5.7 29 4.2 1.3 6.3 2.5 5.) 1.9 
3.2 1.4 0.2 6.2 2.9 43 1,3 6.5 3.0 5.2 2.0 
3.7 1.5 0.2 5.1 2.5 3.0 1,1 6.2 3.4 5.4 2.3 
3.3 1.4 0.2 5.7 2.8 4.1 1.3 5.9 3.0 5.1 1.8 


281.3 , _ { 836,75 3275.55 
uxa | al 2X_% Cae 13056.17} 


Find ji, 3, S, and p. 


Find fi, ©, and ( §,,) for Iris setosa from Table 3.4, taken from Edgar Anderson’s 
famous iris data [Fisher (1936)]. 


(Sec. 3,2) Invariance of the sample correlation coefficient. Prove that rj. is an 
invariant characteristic of the sufficient statistics x and S$ of a bivariate sample 
under location and scale transformations (x4 =b,x,;, +¢,, b,>0, i= 1,2, a= 
1,...,N) and that every function of ¥ and S$ that is invariant is a function of 
Fi. [ Hint: See Theorem 2.3.2.} 


(Sec. 3.2) Prove Lemma 3.2.2 by induction. [ Hint: Let H, =hy;, 


and use Problem 2.36.] 


(Sec. 7.2} Show that 
1 jp : in 
N(N=1)y Aes —Xg)(%, — Xp)! = WO x (x, —#)(x,-F)'. 


(Note: When p= 1, the left-hand side is the average squared differences of the 
observations.) 


112 


3.10, 


3.11. 


3.12 


3.13. 


3.14 


3.15, 


3.16, 


3.17. 


ESTIMATION OF THE MEAN VECTOR AND THE COVARIANCE MATRIX 


(Sec. 3.2) Estimation of X when ws is known. Show that if x,,...,4%y constitute 
a sample from N(w,Z) and qt is known, then (1/N)LN_ (x, — wx, — 1)’ is 
the maximum likelihood estimator of . 


(Sec. 3.2) Estimation of parameters of a complex normal distribution. Let 
Zyye++9Z, be N observations from the complex normal distributions with mean 
@ and covariance matrix P. (See Problem 2.64.) 


(a) Show that the maximum likelihood estimators of @ and P are 
iN 
_ = =\* 
tyr «=P Le (20 E (20-2) : 
at 


(b) Show that z has the complex normal distribution with mean ® and covari- 
ance matrix (1/N)P. 

(c) Show that z and P are independently distributed and that NP has the 
distribution of 2” _,W,W*, where W,,...,W, are independently distributed, 
each according to the complex normal distribution with mean 0 and covari- 
ance matrix P, and n=N-—1. 


(Sec. 3.2) Prove Lemma 3.2.2 by using Lemma 3.2.3 and showing N log|C| — 
trCD has a maximum at C= ND~' by setting the derivatives of this function 
with respect to the elements of C = X~1 equal to 0. Show that the function of C 
tends to — 00 as C tends to a singular matrix or as one or more elements of C 
tend to co and/or —0o (nondiagonal elements); for this latter, the equivalent 
of (13) can be used. 


(Sec. 3.3) Let X, be distributed according to N(yc,,Z), a= 1,..., NM, where 
¥c2 > 0. Show that the distribution of g =(1/Le2)Le, X, is NEy,(1/Le2)=I. 
Show that £=L,(X, —ec,4X, —ec,)’ is independently distributed as 
rN7IZ, Zi, where Z,,...,Z, are independent, each with distribution N(0, Z). 
[Hint: Let Z, = Lb, gX_, where byg=cg/y Lc, and B is orthogonal.] 


(Sec. 3.3) Prove that the power of the test in (19) is a function only of p and 
[NiNo/(N, + Nop? = ph) E 7 (e — pe), given a. 


(Sec. 3.3) Efficiency of the mean. Prove that ¥X is efficient for estimating m. 


(Sec. 3.3) Prove that ¥ and S have efficiency [(N — 1)/N]"\?+ for estimat- 
ing @ and =. 


(Sec. 3.2) Prove that Pr{|A| = 0} =0 for A defined by (4) when N > p. [Hint: 
Argue that if Z*=(Z,,...,Z,), then |Z3| #0 implies A= ZRZN + 
eens ZaZy is positive definite Prove Pr{|Z*| = Z| ZF || +2i7/Z,, jcot(Z,,) 
= 0} = 0 by induction, j=2,...,p] 


PROBLEMS 113 
3.18. (Sec. 3.4) Prove 
I-@(@+E) '= (P42). 
©-O(@ +5) '@=(@' +57!) 


3.19. (Sec. 3.4) Prove (1/N)ZN_ (x, — pXx, — p)! is an unbiased estimator of = 
when 4& is known. 


3.20. (Sec. 3.4) Show that 


Pes ae Least - ete te 5 
o(+ 52] x+q2(@+ q2] v=(@°'+NI7') (NE7'x+ @"'yv) 


3.21. (Sec. 3.5) Demonstrate Lemma 3.5.1 using integration by parts. 


3.22. (Sec. 3.5) Show that 


Wee aya 
e720 @) dy, 


a = 1 ~Ha-0¥ | aa ss 
[PPro ange toe evar f 


~toy-ay 
e774) 


avdy =f" lf dlgee te cas 


3,23. Let Z(k) =(Z,(k)), where i=1,....p, j=l...,g and k=1,2,.... be a 
sequence of random matrices. Let one oy of a matrix A be N(A)= 
max; ,mod(a,,), and another be N,(A)= a> =tr AA’. Some alternative 

ways “of defining stochastic convergence of van to B (p Xq) are 

(a) N,(Z(k) — B) converges stochastically to 0, 

(b) N2(Z(k) — B) converges stochastically to 0, and 

(c) Zk) ~ bj; converges stochastically to 0, i=1,...,p, /=1,....q. 


Prove that these three definitions are equivalent. Note that the definition of 


X(k) converging stochastically lo @ is thal [or every arbilrary positive 6 and «. 
we can find K large enough so that for k > K 


Pr{|X(k) —al <8} >1-«. 


3.24. (Sec. 3.2) Covariance matrices with linear structure [Anderson (1969)]. Let 


q 
(i) X= }' a,G,. 


e ou 


ESTIMATION OF THE MEAN VECTOR AND THE COVARIANCE MATRIX 


where Go,...,G, are given symmetric matrices such that there exists at least 
one (q+ 1)-tuplet oo, 0,,...,0, such that (i) is positive definite. Show that the 
likelihood equations based on N observations are 


(ii) — Le 1G, + Ft Ad-1G,3-! =0, B= OL erg 


Show that an iterative (scoring) method can be based on 
q 5 _ ; 1 
(iii) tr £1.G,2/2,6,6. = pt 276,274, 4, 8=0,1,.05g; 
A : 


h= 


where ©,_, 54.9 6"%G,. 


"ae ll MM a z=  - 


CHAPTER 4 


The Distributions and Uses of 
Sample Correlation Coefficients 


4.1. INTRODUCTION 


In Chapter 2, in which the multivariate normal distribution was introduced, it 
was shown that a measure of dependence between two normal variates is the 
correlation coefficient ,; = 0;;//0:9j- In a conditional distribution of 
Xy.04,X,q Given X44, =Xg41,-+-, Xp =Xp, the partial correlation p;;.941,..,p 
measures the dependence between X; and X;. The third kind of correlation 
discussed was the multiple correlation which measures the relationship 
between one variate and a set of others. In this chapter we treat the sample 
equivalents of these quantities; they are point estimates of the population 
quantities. The distributions of the sample correlations are found. Tests of 
hypotheses and confidence intervais are developed, 

In the cases of joint normal distributions these correlation coefficients are 
the natural measures of dependence. In the population they are the only 
parameters except for location (means) and scale (standard deviations) pa- 
rameters. In the sample the correlation coefficients are derived as the 
reasonable estimates of th2 population correlations. Since the sample means 
and standard deviations are location and scale estimates, the sample correla- 
tions (that is, the standardized sample second moments) give all possible 
information about the population correlations. The sample correlations are 
the functions of the sufficient statistics that are invariant with respect to 
location and scale transformations; the population correlations are the func- 
tions of the parameters that are invariant with respect to these transforma- 
tions. : 


An Introduction to Multivariate Stat.stical Analysis, Third Edition. By T. W. Anderson 
ISBN 0-471-36091-0 Copyright © 2003 John Wiley & Sons, Inc. 


115 


116 SAMPLE CORRELATION COEFFICIENTS 


In regression theory or least squares, one variable is considered random or 
dependent, and the others fixed or independent. In correlation theory we 
consider several variables as random and treat them symmetrically. If we 
start with a joint normal distribution and hold all variables fixed except one, 
we obtain the least squares model because the expected value of tlle random 
variable in the conditional distribution is a linear function of the variables 
held fixed. The sample regression coefficients obtained in least squares are 
functions of the sample variances and correlations, 

In testing independence we shall see that we arrive at the same tests in 
either casc (ie., in the joint normal distribution or in the conditional 
distribution of least squares). The probability theory under the null hypothe- 
sis is the same. The distribution of the test criterion when the null hypothesis 
is not true differs in the two cases. If all variables may be considered random, 
one uses correlation theory as given here; if only one variable is random, 
one uses least squares theory (which is considered in some generality in 
Chapter 8). 

In Section 4.2 we derive the distribution of the sample correlation coeffi- 
cient, first when the corresponding population correlation coefficient is 0 (the 
two normal variables being independent) and then for any value of the 
population coefficient. The Fisher z-transform yields a useful approximate 
normal distribution. Exact and approximate confidence intervals are devel- 
oped. In Section 4.3 we carry out the same program for partial correlations, 
that is, correlations in conditional normal distributions. In Section 4.4 the 
distributions and other properties of the sample multiple correlation coeffi- 
cient are studied. In Section 4,5 the asymptotic distributions of these cor- 
relations are derived for elliptically contoured distributions. A stochastic 
representation for a class of such distributions is found. 


4.2. CORRELATION COEFFICIENT OF A BIVARIATE SAMPLE 


4.2.1. The Distribution When the Population Correlation Coefficient Is Zero; 
Tests of the Hypothesis of Lack of Correlation 


In Section 3.2 it was shown that if one has a sample (of p-component vectors) 
X,,»-.+,%y from a normal distribution, the maximum likelihood estimator of 
the correlation between X, and X, (two components of the random vector 
X) is 


pane © Fe —X;)(x,. —%,) 


UA Cae —z,) V Leni ( Xa ~%,) 


(1) j= 


4.2 CORRELATION COEFFICIENT OF A BIVARIATE SAMPLE ~ 117 


where x;, is the ith component of x, and 


(2) X= 


N 
De Xia: 
a=1 


2|- 


In this section we shall find the distribution of r,, when the population 
correlation between X; and X;, is zero, and we shall see how to use the 
sample correlation coefficient to test the hypothesis that the population 
coefficient is zero. 

For convenience we shall treat r,.; the same theory holds for each ,,,. 
Since r,. depends only on the first two coordinates of each x,, to find the 
distribution of r,, we need only consider the joint distribution of (x,), x2,), 
6 ares eee (x, n> X2n). We can reformulate the problems to be considered 
here, therefore, in terms of a bivariate normal distribution. Let xf,..., x}, be 
observation vectors from 


2 
by a; 0,02 p 
(3) oJ 


We shall consider 


(4) r= 
V4 ¥F2 
where 
N 
(5) a= DL (X%ja~%)(%ja —%,)s ij=1,2. 
a=1 


and x; is defined by (2), x;, being the ith component of x*. 
From Section 3,3 we see that a@,,, @,., and a,, are distributed like 


rt 
(6) a= Ye ZaZyar ij 1,2; 
a=l 


where n=N —1, (z,,,2Z2,) is distributed according to 


(7) re 0 o, 0,02 p 


2 
0 929) p o2 


and the pairs (2,,, 23,),.-.,(Z)y» Z2y) are independently distributed. 


118 SAMPLE CORRELATION COEFFICIENTS 


Figure 4.1 


Detine the n-component vector v,=(z,,,...,2,,), #=1,2. These two 
vectors can be represented in an n-dimensional space; see Figure 4.1. The 
correlation coefficient is the cosine of the angle, say 6, between v, and v. 
(See Section 3.2.) To find the distribution of cos @ we shall first find the 
distribution of cot @. As shown in Section 3.2, if we let b= v,v,/v)v,, theu 
v,~ br, is orthogonal to v, and 


blly, ll 
(8) cot 6= een I" 
If v, is fixed, we can rotate coordinate axes so that the first coordinate axis 
lies along v,. Then bv, has only the first coordinate different from zero, and 
v,— br, has this first coordinate equal to zero. We shall show that cot 6 is 
proportional to a t-variable when p= 0. 

We use the following lemma. 


Lemma 4.2.1. JfY,,...,¥, are independently distributed, if Y, = (¥{0', ¥") 
has the density f(y,), and if the conditional density of ¥ given Y{) =y™ is 
FOSNYO), a=1,...,n. then in the conditional distribution of Y,...,¥ 
given ¥{) = yi), ee = y) the random vectors ¥(,..., ¥°° are independent 
and ie density of Y is f(yPly), a=1,...,n 


pooh: Tne marginal density of ri ar) iA is Bias 1 f(y), where FiO) 
is the marginal density of Y‘", and the conditional density of ¥@,..., ¥@ 
given ¥{" =yi?,... ¥(0 = yl is 


II. FU. ; 
) ao 0 oa TI fv?l?). tw 


4.2. CORRELATION COEFFICIENT OF A BIVARIATE SAMPLE 119 


Write V,=(Z,,,...,Z;,)', i=1,2, to denote random vectors. The condi- 
tional distribution of Z,, given Z,, =z,, is N( Bz,,, 77), where B= po,/o, 
and o? = a}(1 — p*), (See Section 2.5.) The density of V, given V, =v, is 
N( Bv,, 071) since the Z,, are independent, Let b = V3v,/viv, (= a,,/a4,), 
so that bv{(V, — bv,) =0, and let U =(V, — bv,)'(V, — bv) = VV, — b? vy, 
(= ay — aj, /a,,). Then cot @=b/a,,/U. The rotation of coordinate axes 
involves choosing an n Xn orthogonal matrix C with first row (1/c)v,, where 
c=. 

We now apply Theorem 3.3.1 with X,=2Z,,. Let ¥,= LgcagZog, a= 


1,...,n. Then Y,,...,¥, are independently normally distributed with vari- 
ance a” and means 
(10) 8Y,= Vic Bz eet 
] ly ly Cc ly ’ 
yok y=1 
n at 
(11) &Y,= Y Cy, B21, = Be Y €4,¢,,=9, axel. 
y=1 y= 1 


We have b= i ieie/ eles = Clie I Zi Ciese ion Y,/c and, from 
Lemma 3.3.1, 


(12) U 


I 
M 


AiO cigs Le ee 
a-] 


a=] 
n 
— 2 
= vos 
a=2 


which is independent of b. Then U/o?* has a y*distribution with n—1 
degrees of freedom. 


Lemma 4.2.2. If (Z,,,Z,,), @=1,...,m, are independent, each pair with 
density (7), then the conditional distributions of b=" _,Zy_Zq/L-1Zia 
and U/o? =2"_(Z,,—bZ,,)?/o* given Z,,=Z,q, @=1,...,, are 
N( B, o7/c*) (ce? =5"_,z7,) and x? with n—1 degrees of freedom, respec- 
tively; and 6 and U are independent. 


If p=0, then B=0, and b is distributed conditionally according to 
N(O, o7/c*), and 


(13) et eee 


120 SAMPLE CORRELATION COEFFICIENTS 


has a conditional t-distribution with n — 1 degrees of freedom. (See Problem 
4.27.) However, this random variable is 


(14) eat V1 12/2) Be ese Aya / ¥ Ay, Ax 


Qoq ~ Q}/ Ay) ies [ a7, /( a4 422)| 
r 


oe 


Thus ¥n —1r/V¥1-r? has a conditional ¢-distribution with n — 1 degrees of 
freedom. The density of ¢ is 


P(3n) eis 
(15) Reape it aaa] 


and the density of W=r/V1-—r? is 


T'(3n) 2 -m 
Ti(n- ve"? 


=yn—1l 


(16) 


Since w = r(1—r2)~ 2, we have dw/dr=(1—r?)7 ?, Therefore the density of 
r is (replacing n by N — 1) 


rN - 1] 
P[E(N—2)] Ve 


a payee’. 


(17) 


It should be noted that (17) is the conditional density of r for v, fixed. 
However, since (17) does not depend on »v,, it is also the marginal density 
of r. 


Theorem 4.2.1, Let X,,...,Xy be independent, each with distribution 
N(w, 2). If ,,= 0, the density of r,, defined by (1) is (17). 


From (17) we see that the density is symmetric about the origin. For 
N > 4, it has a mode at r= 0 and its order of contact with the r-axis at +1 is 
3(N —5) for N odd and 3N—3 for N even. Since the density is even, the 
odd moments are zero; in particular, the mean is zero. The even moments 
are found by integration (letting x =r? and using the definition of the beta 
function), That &r?” =T[2(N — 1)T0m + 1)/(va TRCN — 1) + mJ) and in 
particular that the variance is 1/(N — 1) may be verified by the reader. 

The most important use of Theorem 4.2.1 is to find significance points for 
testing the hypothesis that a pair of variables are not correlated, Consider the 


4.2 CORRELATION COEFFICIENT OF A BIVARIATE SAMPLE 121 


hypothesis 
(18) H: p,=0 


for some particular pair (/,j). It would seem reasonable to reject this 
hypothesis if the corresponding sample correlation coefficient were very 
different from zero. Now how do we decide what we mean by “very 
different”? 

Let us suppose we are interested in testing H against the alternative 
hypotheses p,, > 0. Then we reject H if the sample correlation coefficient r,, 
is greater than some number r,. The probability of rejecting H when H is 
true is 


(19) f kul) dr, 


where k,(r) is (17), the density of a correlation coefficient based on N 
obServations, We choose ry so (19) is the desired significance level. If we test 
H against alternatives p,, <0, we reject H when r,,< —ro. 

Now suppose we are interested in alternatives p,, #0; that is, p,, may be 
either positive or negative, Then we reject the hypothesis H if r,,>7, or 
r;; << —r,. The probability of rejection when H is true is 


(20) [ky (r) drt fk (rn) ar. 


The number r, is chosen so that (20) is the desired significance level. 

The significance points r, are given in many books, including Table VI of 
Fisher and Yates (1942); the index n in Table VI is equal to our N — 2. Since 
VN—2r/V1—r’ has the ¢-distribution with N—-2 degrees of freedom, 
t-tables can also be used, Against alternatives p,, #0, reject H if 


Ir,,l 


Sat y"=- (a), 
2 N-2 
yi-n) 


where t,_,(q@) is the two-tailed significance point of the t-statistic with N — 2 
degrees of freedom for significance level a. Against alternatives p,, > 0. 
reject H if 


(21) N-2 


(22) Noe Si 2a): 


ylan 


122 SAMPLE CORRELATION COEFFICIENTS 


From (13) and (14) we see that YN —2r/V1-—7r? isthe proper statistic for 
testing the hypothesis that the regression of V, on v, is zero. In terms of the 
original observation {x,,}. we have 


(23) VN - 2 ———— = ————— eee 
heme X3q 7 F,— b(4,4 Fi) /(N — 2) 


where b= E4.,Cx2, ~¥:X4,, -¥,)/L4_ Cx), —x,)° is the least squares re- 
gression coefficient of x., on x,,. lt is seen that the test of p,. = 0 is 
equivalent to the test that the regression of X, on x, is zero (i.e., that 
P12 92/0, = 0). 

To illustrate this procedure we consider the example given in Section 3.2. 
Let us test the null hypothesis that the effects of the two drugs are uncorre- 
lated against the alternative that they are positively correlated. We shall use 
the 5% level of significance. For N= 10, the 5% significance point (7) is 
0.5494. Our observed correlation coefficient of 0.7952 is significant; we reject 
the hypothesis that the effects of the two drugs are independent. 


4.2.2. The Distribution When the Population Correlation Coefficient Is 
Nonzero; Tests of Hypotheses and Confidence Intervals 


To find the distribution of the sample correlation coefficient when the 
population coefficient is different from zcro, we shall first derive the joint 
density of @,,, @,,, and a,,. In Section 4.2.1 we saw that, conditional on v, 
held fixed. the random variables b = a,, /a,, and U/o? = (ay) — aj, /a4,)/o° 
are distributed independently according to N( 8, o*/c*) and the y?-distribu- 
tion with 1 - 1 degrees of freedom, respectively, Denoting the density of the 
y‘-distribution by g,,_,(u), we write the conditional density of b and U as 
n(b| B.0*/a,,)g,_(u/o*)/o7. The joint density of V,, b, and U is 
a(p, 10, of On(bl B, o*/a,,g,-\t/o7?)/o?. The marginal density of 
ViV,/o; = a,,/o; is g,{u); that is, the density of a,, is 


(24) e,[ 24 = pe » f n( 010, 021) dW, 


eye, Fay 


where dW is the proper volume element. 
The integration is over the sphere vy, =a,,; thus, dW is an element of 
area on this sphere. (See Problem 7.1 for the use of angular coordinates in 


4.2 CORRELATION COEFFICIENT OF A BIVARIATE SAMPLE 123 


defining dW.) Thus the joint density or b, U, and a,, is 


(25) f -f n(D1B, 0? /a,,)8,-s(u/o?) en(2410, ofl) dw 


OP =4), 


= 89 /or )n(0lB, o*/a,,)8,-\(u/o’) 


ofa? 
= ao ayy Vu ay) b 2 
OMNI EN dt | ae 9 UB) 
(207)? T(gn) 20/) V2a0 20 
1 4(n-3) ( u ] 
a ee 
(207) YT [h(n -1)] acs 


Now let b = aj) /a,,, U = ay) — a?, /a,,. The Jacobian is 


1 
—— 0 
(26) sree, || a —s 
A(4i2,422)| | _9%2 4] Su 
ay 


Rima? 

= 

atf->( Sut) e7 #2 
i 


Qalos (1p?) Val (An)[3(n~ Ly] 


2.2.2 

) ay , 4n | 4 4 0,0. 49 pa, 
(28) Q 2 2a 1 4 
ay O71 HW 07; 


1 ai 
+—|ay-—4 
atl ay 


1 projoy po a9 
wo3(1-p")  oF(1—p") 


124 SAMPLE CORRELATION COEFFICIENTS 
The density can be written 


29) |A| 4(n-3) ee 40 
21ElVaT Garin] 


for A positive definite, and.0 otherwise. This is a special case of the Wishart 
density derived in Chapter 7. 

We want to find the density of 
(30) Tan apf ( O02) 0102) ce at 


x ok °° 
Vane. ral a\,/o/) (ax/o7) V 41) 422 


where at, = a,,/o/, a3, = @,,/o7, and a*, = a,,/(a, 0). The traasformation 
is equivalent to setting o,=0,=1. Then the density of a@,,, a .., and 


r= Ay /y a) ax (day, = Arya), 42) is 


ait I gin WL —p2yi" 9-20 


31 i ee 
= 2"(1— p?)"Val (4n)I[}(n - 1)] 
where 
(32) Q= id 26 Ce 


1-p 


To find the density of r, we must integrate (31) with respect to a,, and ay, 
over the range 0 to oo. There are various ways of carrying out the integration, 
which result in different expressions for the density. The method we shall 
indicate here is straightforward. We expand part of the exponential: 


prfay van | 2 (orfan Van) 
exp (1 ~ p*) ‘> » 


33 a 
( «=0 at(l ae p’) 
Then the density (31) is 


(1 3) 


(34) — 4) ___ aes 
(1- p’) 2"Va¥l (gn)T[$(n-1)] a=0 al(l—p ) 


fof ale atti) 


42 CORRELATION COEFFICIENT OF A BIVARIATE SAMPLE 125 


Since 


[>] F 1 
35 Hnta)-1 = a da=T 1 + A(T 2 ee 
(95) faker! exp] ~ 577A | dem sin + a9] [201 ~ 0°) 
the integral of (34) (term-by-term integration is permissible) is 


(1 a aie 
(1 p*)"2°VaP(dn)r[(n ~ 1)] 
i Tie 4 a)|2"*7(1 = py'*" 


_(=p*)"U=r 2yHa- 3) x (2or)" vee 
Fatt] 2 a) Tinta. 


(36) 


The duplication formula for the gamma function is 


(37) T(2z) = Pater) 


7 
It can be used to modify the constant in (36). 


Theorem 4.2.2. The correlation coefficient in a sample of N from a bivariate 
normal distribution with correlation p is distributed with density 


qn-2(1 — p2 tn 1 4(n-3) oo 
(38) c pe) p Ger [i(n+a)], 


-lersl, 
where n = N— 1. 


The distribution of r was first found by Fisher (1915). He also gave as 
another form of the density, 
| 


(= p?)" =r oe a (ee 


See Problem 4.24. 


126 SAMPLE CORRELATION COEFFICIENTS 


Hotelling (1953) has made an exhaustive study of the distribution of r. He 
has recommended the following form: 


n-1l T(n) 2\in 2\ 0-3) 
me —_——we ‘a = = . 
(40) 5 Tasty p”)"(1-?*) 
(tap Fla aed ASE), 

where 

, we Tats P(b+s) Tle) x! 
41 F(a,b:c, x)= — = 
™ ( ) XL T(a) F(b) Tet) i! 


is a hypergeometric function. (See Problem 4.25.) The scries in (40) converges 
more rapidly than the one in (38). Hotelling discusses methods of integrating 
the density and also calculates moments of r. 

The cumulative distribution of r, 


(42) Pr{r <r*) =F(r*|N, p), 


has been tabulated by David (1938) for’ p= 0(.1).9, V = 3(1)25, 50, 100, 200, 
400, and r* = —1(.05)1. (David’s n is our N.) It is clear from the density (38) 
that F(r*|N, p) =1—F(-r*|N,— p) because the density for r, p is equal to 
the density for —r,— p. These tables can be used for a number of statistical 
procedures, 

First, we consider the problem of using a sample to test the hypothesis 


(43) A: p= py. 


If the alternatives are p> p,. we reject the hypothesis if the sample correla- 
tion coefficient is greater than ry, where ry is chosen so 1 — F(rlN, po) = a, 
the significance level. If the alternatives are p< pp, we reject the hypothesis 
if the sample correlation coefficient is less than rj, where ro is chosen so 
F(ro|N, po) = a. If the alternatives are p ¥ py, the region of rejection is r > r, 
and r <r|, where r, and r; are chosen so [1 — F(r,|N, p))] + FCN, po) = a. 
David suggests that r, and r\ be chosen so [1 — F(r,|N, pp J= FON, po) 
= ta. She has shown (1937) that for N= 10, |p| < 0.8 this critical region is 
nearly the region of an unbiased test of H, that is, a test whose power 
function has its minimum at pp. 

It should be pointed out that any test based on r is invariant under 
transformations of location and scale, that is, x*, =b,x,, +¢;, b> 0, i= 1,2, 


ta 


*p = K.1).9 means p=0,0.1,0.2,...,0.9. 


4,2. CORRELATION COEFFICIENT OF A BIVARIATE SAMPLE 127 


Table 4.1. A Power Function 


p Probability 
-~10 0.0000 
—0.8 0.0000 
—0.6 0.0004 
—04 0.0032 
—0.2 0.0147 

0.0 0.0500 

0.2 0.1376 

0.4 0.3215 

0.6 0.6235 

0.8 0.9279 

1.0 1.0000 

a~=1,...,N; and r is essentially the only invariant of the sufficient statistics 


(Problem 3.7). The above procedure for testing H: p=, against alterna- 
tives p>, is uniformly most powerful among all invariant tests. (See 
Problems 4.16, 4.17, and 4.18.) 

As an example suppose one wishes to test the hypothesis that p= 0.5 
against alternatives p * 0.5 at the 5% level of significance using the correla- 
tion observed in a sample of 15. In David’s tables we find (by interpolation) 
that F(0.027|15,0.5) = 0.025 and F(0.805|15,0.5) = 0.975. Hence we reject 
the hypothesis if our sample r is less than 0,027 or greater than 0.805. 

Secondly, we can use David’s tables to compute the power function of 
a test of correlation. If the region of rejection of H is r>r, and r<ri, 
the power of the test is a function of the true correlation p, namely 
{1 —F(r,|N, p) + LFCAIN, p)|; this is the probability of rejecting the null 
hypothesis when the population correlation is p. 

As an example consider finding the power function of the test for p=0 
considered in the preceding section. The rejection region (one-sided) is 
r> 0.5494 at the 5% significance level. The probabilities of rejection are 
given in Table 4.1. The graph of the power function is illustrated in Figure 
4.2, 

Thirdly, David’s computations lead to confidence regions for p. For given 
N, r (defining a significance point) is a function of p, say f\( p), and r, is 
another function of p, say f>( p), such that 


(44) Pr{fi( p) <r<fo{ p)lp}=1—a. 


Clearly, f;() and f,( p) are monotonically increasing functions of p if r, 
and ri are chosen so 1 — F(r,|N. p) = 4a=F(r\IN, p). If p=f-'(r> is the 


128 SAMPLE CORRELATION COEFFICIENTS 


-1 0 lp 


Figure 4.2. A power function. 


inverse of r=f,( p), i= 1,2, then the inequality f,( p) <r is equivalent to‘ 
o<f;,'(r), and r <f,( p) is equivalent to fy!(r) < p. Thus (44) can be written 


(45) Pr{fz'(r) <p<fi'(r)lp} =1—a. 


This equation says that the probability is 1 — a@ that we draw a sample such 
that the interval (f;'(r), f7'(r)) covers the parameter p. Thus this interval is 
a confidence interval for p with confidence coefficient 1 - a. For a given N 
and a@ the curves r=f,( 9) and r=f,( p) appear as in Figure 4.3. In testing 
the hypothesis p= po, the intersection of the line p= pp) and the two curves 
gives the significance points r, and rj. In setting up a confidence region for p 
on the basis of a sample correlation r*, we find the limits f;'(r*) and 


Figure 4.3 


'The point (f,( p), p) on the first curve is to the left of (7, p), and the point (7, f7 '(r)) is above 
(r, p). 


4.2. CORRELATION COEFFICIENT OF A BIVARIATE SAMPLE 129 


f, '(r*) by the intersection of the line r=r* with the two curves. David gives 
these curves for a =0.1, 0.05, 0.02, and 0.01 for various values of N. One- 
sided confidence regions can be obtained by using only one inequality above. 

The tables of F(r|N, p) can also be used instead of the curves for finding 
the confidence interval. Given the sample value r*, f;'(r*) is the value of p 
such that $a = Pr{r <r*|p} = F(**|N, p), and similarly fy'(r*) is the value 
of p such that $a=Pr{r>r*|p}=1-—F(r*|N, p). The interval between 
these two valucs of p, (f7'(r*), f7'(r*)), is the confidence interval. 

As an example, consider the confidence interval with confidence coeffi- 
cient 0.95 based on the correlation of 0.7952 observed in a sample of 10. 
Usiag Graph II of David, we {ind the two limits are 0.34 and 0.94. Hence we 
state that 0.34 < p< 0,94 with confidence 95%. 


Definition 4.2.1. Let L(x,6) be the likelihood function of the observation 
vector x and the parameter vector 8 € 1), Let a null hypothesis be defined by a 
proper subset w of Q). The likelihood ratio criterion is 


SuPyc w(x, 6) 


) aC: SUPg <1 L(x, 8) | 


The likelihood ratio test is the procedure of rejecting the null hypothesis when 
A(x) is less than a predetermined constant. 


Intuitively, one rejects the null hypothesis if the density of the observa- 
tions under the most favorable choice of parameters in the null hypothesis is 
much less than the density under the most favorable unrestricted choice of 
the parameters. Likelihood ratio tets have some desirable features. see 
Lehmann (1959), for example. Wald (1943) has proved some favorable 
asymptotic properties, For most hypotheses concerning the multivariate 
normal distribution, likelihood ratio tests are appropriate and often are 
optimal. 

Let us consider the likelihood ratio test of the hypothesis that p= po 
based on a sample x,,...,xX, from the bivariate normal distribution, The set 
© consists of py, M2, 7}, 7, and p such that o, >0, 0, >0. ~-l<p<L. 
The set w is the subset for which p= pp. The likelihood maximized in 1 is 
(by Lemmas 3.2.2 and 3,2.3) 


NN -N 
(47) ' max L = - ——-— 
° (27) (tr) ahar ~ 


130 SAMPLE CORRELATION COEFFICIENTS 


Under the null hypothesis the likelihood function is 


i Oy /TH TA — 2py ay 
(48) Saar ae ; 
2m )"(1— pe)" (0?)" 20°(1— 69) 


where o?= 0,0, and t=0,/o. The maximum of (48) with respect to 7 
occurs at T= Ya), / a2. . The concentrated likelihood is 


1 : [= ee 


49 —_——_——__,—— 
ce (2ar)" (1 — 93)" (6?) as *(1- pG) 


the maximum of (49) occurs at 


qi +a3,(1 — Por) 


50 p? 
ee) Oe ane) 


The likelihood ratio criterion is, therefore, 


1 € 
max, L _ (1 pa) (1-72)! -|¢ = )(1=1) | 
(1- por) 
The likelihood ratio test is (1 — pj X1 —r?X1 — por)~* <c, where c is chosen 
so the probability of the inequality when samples are drawn from normal 


populations with correlation p,) is the prescribed significance level. The 
critical region can be written equivalently as 


(52) ( poe — pp + 1)r? - 2pyer +e -1+ pj > 9, 


oe pyc + (1 — pp)V1-c 


poct1—pi 
(53) 
< pyc — (1- pp)V1—-¢ 
r ; 
pec +1 — p5 


Thus the likelihood ratio test of H: p= po against alternatives p# py has a 
rejection region of the form r>r, and r <r}; but r, and rj are not chosen so 
that the probability of each inequality is a/2 when #7 is true, but are taken 
to be of the form given in (53), where c is chosen so that the probability of 
the two inequalities is a. 


4.2 CORRELATION COEFFICIENT OF A BIVARIATE SAMPLE 131 


4.2.3. The Asymptotic Distribution of a Sample Correlation Coefficient 
and Fisher’s z 
In this section we shall show that as the sample size increases, a sample 
correlation coefficient tends to be normally distributed. The distribution of a 
particular function of a sample correlation, Fisher’s z [Fisher (1921)}, which 
has a variance approximately independent of the population correlation, 
tends to norma.ity faster. 

We are particularly interested in the sample correlation coefficient 


A;{n) 
y4iu(n)4){7) 


for Some i and j, i#j. This can also be written 


C,,{n) 
C,(n)C, (rn) 


(54) r(n) = 


(55) r{n) = 


where C,,(n) =Ag(1)/ y Oe Onn The set C,(n), C,(n), and C,,(n) is dis- 
tributed like the distinct elements of the matrix 


a Zi, a Zial Vou 
v?) x Li ee) 7 Xu Zya/Y¥% (2/1 210/21) 


where the (Z*,, Z7,) are independent, each with distribution 


where 
gs ou 
i Of 
Let 
Cun) 
(57) U(n) => C)(n) |, 
C)() 
1 
(58) b=|1 


132 SAMPLE CORRELATION COEFFICIENTS 


Then by Theorem 3.4.4 the vector yn {U(n) —b] has a limiting normal 
distribution with mean 0 and covariance matrix 


2 2p? 2p 
(59) 2p* 2 2p 
2p 2p 1+)? 


Now we need the general theorem: 


Theorem 4.2.3. Let {U(n)} be a sequence of m-component random vectors 
and b a fixed vector such that yn {U(n) — b] has the limiting distribution N(0, T) 
asn-~ oo. Let f(u) be a vector-valued function of u such that each component 
f(u) has a nonzero differential at u=b, and let dfj(u)/du,|,.5 be the i, jth 
component of ®,. Then Yn{flu(n)|—f(b)} has the limiting distribution 
N(O, &,T@®,). 


Proof. See Serfling (1980), Section 3.3, or Rao (1973), Section 6a.2. A 
function g(u) is said to have a differential at b or to be totally differentiable 
at b if the partial derivatives dg(u)/déu, exist at u=b and for every e> 0 
there exists a neighborhood N,(b) such that 


(60) 


g(u) -2(b) - ¥ 784) (4, <ellu—bll forallueN,(b). @ 


=| 


It is clear that U(n) defined by (57) with b and T defined by (58) and (59), 
respectively, satisfies the conditions of the theorem. The function 


Hy 


ee heal 
r= = U3Uy 7Uy? 
Vayu, 


satisfies the conditions; the elements of ®, are 


(61) 


or 


ee 24 -}, <4 _ _t 
on | | = 7 Maly U2 tlyen = — 20> 
l lesb 
or l -i 1 ! 
(62) ju. = —5U3u; uy *lyoy = — 32, 
2 usb 
or fone - 3 1 
——— =U 2u 2 a = 
OU5 | ,oy ee ? 


4.2. CORRELATION COEFFICIENT OF A BIVARIATE SAMPLE 133 


and f(b) = p. The variance of the limiting distribution of ¥n[r(n) — p] is 


2 2p? 2p \{~-te 
(63) (-3p,-30,1)}2p? 2 2p || 4p 
2p 2p 1+tpt}\ 1 


I 
Sap 


=(p-p*,p-p",1- p*)| — kp 
| 


=1-2p’+p* 
=(1- p"). 
Thus we obtain the following: 


Theorem 4.2.4. /f r(n) is the sample correlation coefficient of a sample of N 
(=n +1) from a normal distribution with correlation p, then vn (r(n) — pl/ 
(1 — p?) [or VN[r(n) — p]/(1 — p*)] has the limiting distribution N(O, 1). 


It is clear from Theorem 4.2.3 that if f(x) is differentiable at x = p. then 
vn[ f(r) — fC p)] is asymptotically normally distributed with mean zero and 
variance 


[3 


Ox 


yu - py. 


x= 


A useful function to consider is one whose asymptotic variance is constant 
(here unity) independent of the parameter p. This function satisfies the 
equation 


— 1 ive 1 
(64) fe) = 73-3 (t35 + Ts): 


Thus f(p) is taken as 5[log(1 + p) — log(1 — p)] = 4 log(1 + p)/(1 — p)}. The 
so-called Fisher’s z is 


(65) z= Hogy—s = tanh"! 
where r= tanh z=(e? ~e77)/(e? +e~*). Let 


(66) £= Hogs 


— p x 


134 SAMPLE CORRELATION COEFFICIENTS 


Theorem 4.2.5. Left z be defined by (65), where r is the correlation coefft- 
cient of a sample of N (=n +1) from a bivariate normal distribution with 
correlation p: let £ be defined by (66). Then ¥n(z— £) has a limitirg normal 
distribution with mean OQ and variance 1. 


It can be shown that to a closer approximation 


(67) br~ lt. 
1 1 2 
(68) &(2-f)~ =a ~élz-0- Ff). 


The latter follows from 


. a 8—p 
(69) é(z-fya—+ a 


and holds good for p?/n? small. Hotelling (1953) gives moments of z to order 
1” *, An important property of Fisher’s z is that the approach to normality is 
much more rapid than for r. David (1938) makes some comparisons between 
the tabulated probabilities and the probabilities computed by assuming z is 
normally distributed. She recommends that for N> 25 one take z as nor- 
mally distributed with mean and variance given by (67) and (68). Konishi 
(1978a, 1978b, 1979) has also studied z. [Ruben (1966) has suggested an 
alternative approach, which is more complicated, but possibly more accurate. ] 
We shall now indicate how Theorem 4.2.5 can be used. 


a. Suppose we wish to test the hypothesis p= p, on the basis of a sample 
of N against the alternatives p# p,. We compute r+ and then z by (65). Let 


(70) fq = zlog 


Then a region of rejection at the 5% significance | :vel is 


(71) VN —3\z—- Z| > 1.96. 
A better region is 
ay 
(72) VN—3|z- fo- pete] > 1.96. 


b. Suppose we have a sample of N, from one population and a sample of 
N, from a second population. How do we test the hypothesis that the two 


4.2 CORRELATION COEFFICIENT OF A BIVARIATE SAMPLE 135 


correlation coefficients cre equal, p, = p,? From Theorem 4.2.5 we know 
that if the null hypothesis is true then z,—z, [where z, and z, are defined 
by (65) for the two sample correlation coefficients] is asymptotically normally 
distributed with mean 0 and variance 1/(N, — 3) + 1/(N, ~ 3). As a critical 
region of size 5%, we use 


|Z, = 25| 


V1/(N, — 3) + 1/042 > 3) 


ce. Under the conditions of paragraph b, assume that p, = p, = p. How 
do we use the results of both samples to give a joint estimate of p? Since z, 
and z, have variances 1/(N,— 3) and 1/(N, — 3), respectively, we can 
estimate ¢ by 


(73) > 1.96. 


(N, — 3)2z, + (N23) 2 
74 = 
(74) N, +N, = 6 


and convert this to an estimate of p by the inverse of (65). 


d. Let r be the sample correlation from N observations. How do we 
obtain a confidence interval for 9? We know that approximately 


(75) Pr{—1.96 < VN—3(z- £) < 1.96} = 0.95. 


From this we deduce that [—-1.967 YN —3 +2,1.96/¥N —3 +2] is a confi- 
dence interval for ¢. From this we obtain an interval for p using the fact 
p=tanh £=(e-e7*)/(eS+e7*), which is a monotonic transformation. 
Thus a 95% confidence interval is 


(76) tanh(z— 1.96/¥N—3)<p<tanh(z+1.96/¥N — 3). 


The bootstrap method has been developed to assess the variability of a 
sample quantity. See Efron (1982). We shall illustrate the method on the 
sample correlation coefficient, but it can be applied to other quantities 
studied in this book. 

Suppose x,,...,X%,y is a sample from some bivariate population not neces- 
sarily normal. The approach of the bootstrap is to consider these N vectors 
as a finite population of size N; a random vector X has the (discrete) 
probability 


(77) Pr(X=x,} =, es ee 


136 SAMPLE CORRELATION COEFFICIENTS 


A random sample of size N drawn from this finite population has a probabil- 
ity distribution, and the correlation coefficient calculated from such a sample 
has a (discrete) probability distribution, say p,(r). The bootstrap proposes to 
use this distribution in place of the unobtainable «distribution of the correla- 
tion coefficient of random samples from the parent population. However, it is 
prohibitively expensive to compute; instead p,,(r) is estimated by the empiri- 
cal distribution of r calculated from a large number of random samples from 
(77). Diaconis and Efron (1983) have given an example of N = 15; they find 
the empirical distribution closely resembles the actual distribution of r 
(essentially obtainable in this special case). An advantage of this approach is 
that it is not necessary to assume knowledge of the parent population; 
a disadvantage is the massive computation. 


4.3. PARTIAL CORRELATION COEFFICIENTS; 
CONDITIONAL DISTRIBUTIONS 


4.3.1. Estimation of Partial Correlation Coefficients 


Partial correlation coefficients in normal distributions are correlation coeffi- 
cients in conditional distributions. It was shown in Section 2.5 that if X is 
distributed according to N(p, 2), where 


x mie 
> uo a 2) 


(1) X= xX 


oo =), 


then the conditional distribution of X given X° =x® is N[p + BOx® — 
we), X12], where 


(2) B=2)2n, 


(3) Ly.2 = Ey — ZyQBy' Ly. 


The partial correlations uf X‘" given x® are the correlations calculated in 
the usual way from %,,... In this section we are interested in Statistical 
problems concerning these correlation coefficients. 

First we consider the problem of estimation on the basis of a sample of N 
from N(p, 2). What are the maximum likelihood estimators of the partial 
correlations of X“ (of q components), 9,,94:,....,2 We know that the 


42 PARTIAL CORRELATION COEFFICIENTS 137 


maximum likelihood estimator of Y is (1/N)A, where 


N 
(4) A= YP (a7 ¥) (4, -¥)! 


a=) 


{l 


N (ty _ eth) 
x a: 
: | f Jour = 301", 2" — 30) 


il 
—_——— 
= 
a 

NH WN 


and ¥=(1/N)L%_ x, = (x ¥°)'. The correspondence between Y and 
Z11-2, B, and 2, is one-to-one by virtue of (2) and (3) and 


(5) 212 = BY», 
(6) Zi = Zy.2+ B22B’. 


We can now apply Corollary 3.2.1 to the effect that maximum likelihood 
estimators of functions of parameters are those functions of the maximum 
likelihood estimators of those parameters. 


Theorem 4.3.1. Let x,,...,%y be a sample from N(p,%), where p 
and ~% are partitioned as in (1). Define A by (4) and (#6)' xO) = 
C1/NJIN_ CeO” x2), Then the maximum likelihood estimators of pO. pO. 
B, Zip... and Lo» are PY =X, PO = FO), 


‘ 5 ‘ \ 7 
(7) B=4,Ax, Yin2 = Fyl4Au 4240401), 


and Ss =(1/N)A,, respectively. 


In turn, Corollary 3.2.1 can be used to obtain the maximum likelihood 
. 1 2 > 
eoumators of pi B pi ’; B, Yn, Fig e, pel seas and Piypg el. Ppt 
i,j =1,...,q. It follows that the maximum likelihood estimators of the partial 


correlation coefficients are 


A 


(8) A Ty ygri. oP 


Pryq+t posted p x x > j=l Rear q. 
PiregtipFryqtl...p 


where 4, is the i, jth element of Z,, 5. 


retin... p 


138 SAMPLE CORRELATION COEFFICIENTS 


Theorem 4.3.2. Let x,,...,%y be a sample of N from N(p, 2%). The 
maximum likelihood estimators of P,).441,.... p» the partial correlations of the first 
g components conditional on the last p — q components, are given by 


a 


A i Ipqgt l.....p . oe 
(9) Prypgtl.wp ? i,j=1,...,g, 
Bitg +1 (xo paypegth vp 
where 
pay = -1 a 
(10) (@g4t. sp) =An ~ A124 An =Airr- 


The estimator f,,.941,....p» denoted by r,.941.....p» iS called the sample 
partial correlation coefficient between X, and X, holding X,,\,-.., X,, fixed. It is 
also called the sample partial correlation coefficient between X; and X;, 
having taken account of X,,,,...,X,. Note that the calculations can be done 
in terms of (r,,). 

The matrix .4,,., can also be represented as 


(11) A: 


{l 


N 
> [ xt? — 0) — B(x — #2)| [2 — 5) — B(x — x) |! 


a= 


Ay - BA,.B a 


The vector x{) —£ — B(x@ — x) is the residual of x“ from its regression 
on x®) and 1. The partial correlations are simple correlations between these 
residuals. The definition can be used also when the distributions involved are 
not normal. 

Two geometric interpretations of the above theory can be given. In 
p-dimensional space x,,...,X, represent N points. The sample regression 
function 


(12) xh) = gl) 4 B(x — x) 


is a (p — q)-dimensional hyperplane which is the intersection of q (p — 1)- 
dimensional hyperplanes, 


Pp 
(13) X,=x,+ . 2 Blea), i=1,...,q, 


where x,,x, are running variables. Here 8, is an element of B = Seals = 
A,,A5. The ith row of B is (8, ,..,---, Bp). Each right-hand side of (13) is 
the least squares regression function of x, on X,,,,---,%,; that is, if we 


project the points x,,...,%, on the coordinate hyperplane of x;,%,,,,---, 2%)» 


4.3 PARTIAL CORRELATION COEFFICIENTS 139 


© Mag slp 


Figure 4.4 


then (13) is the regression plane. The point with coordinates 
P a 
(14) X, =X; + +e B,;(Xjq —¥,)s t=1,...,q; 


i ja? J=qtl,...,p, 


is on the hyperplane (13). The difference in the ith coordinate of x, and the 
point (14) is y,, =x;,-[%,+ UP 94) B,{%)4- FI for i= 1,...,q and 0 for 
the other coordinates. Let y= (yrqs-++1Yqa)- These points can be repre- 
sented as N points in a q-dimensional space. Then Aj). = Ly, yy’. 

We can also interpret the sample as p points in N-space (Figure 4.4). Let 
U,=(x,.--,X;y)' be the jth point, and Iet ¢ =(1,..., 1)’ be another point. 
The point with coordinates X,,...,x, is X,e. The projection of u, on the 
hyperplane spanned by u,,1,...,4,, € 18 


P A 
(15) &, =x, + Ss B,,(u, — Fe); 


this is the point on the hyperplane that is at a minimum distance from u,. Let 
uj be the vector from &@; to u,, that is, u;—&,, or, equivalently, this vector 
translated so that one endpoint is at the origin. The set of vectors uf,..., v3 
are the projections of u,,...,u, on the hyperplane orthogonal to 


140 SAMPLE CORRELATION COEFFICIENTS 


Ugais--+yU,, €. Then uf 'ut = @,..941,.,., > the length squared of uj (ie., the 
square of the distance of u from &,). Then uf ut / fut ufus'uy =r, 
is the cosine of the angle between u* and uf. 

As an example of the use of partial correlations we consider some data 
[Hooker (1907)] on yield of hay (X,) in hundredweights per acrc, spring 
rainfall (X,) in inches, and accumulated temperature above 42°F in the 
spring (X,) for an English area over 20 years. The estimates of y,, @, 


(= yo,;), and p,, are 


28.02 
p=zr=| 4.91], 
594 
6 4.42 
(16) é,|= | 1.10], 
Gy 85 
i pp pe 1.00 080 -0.40 
by 1 fyl =| 080 1.00 —0.56 


From the correlations we observe that yield and raimfall are positively 
related, yield and temperature are negatively related, and rainfall) and tem- 
perature are negatively related. What interpretation is to be given to the 
apparent negative relation between yield and temperature? Does high tem- 
perature tend to cause low yield, or is high temperature associated with low 
rainfall and hence with low yield? To answer this question we consider the 
correlation between yield and temperature when rainfall is held fixed; that is, 
we use the data given above to estimate the partial correlation between X, 
and X, with X, held fixed. It is* 


(17) a = 0,097. 


n32> 
V 711.2 733-2 


Thus, ‘f the effect of rainfall is removed, yield and temperature are positively 
correlated. The conclusion is that both high raninfall and high temperature 
increase hay yield, but in most years high rainfall occurs with low tempera- 
ture and vice versa. 


"We compute with & as if it were =. 


43 PARTIAL CORRELATION COEFFICIENTS 141 


4.3.2. The Distribution of the Sample Partial Correlation Coefficient 


In order to test a hypothesis about a population partial correlation coefficient 
we want the distribution of the sample partial correlation coefficient. The 
partial correlations are computed from A,,.,=A,, — A), 437.44, (as indicated 
in Theorem 4.3.1) in the same way that correlations are computed from A. 
To obtain the distribution of a simple correlation we showed that A was 
distributed as DN"! ZZ’, where Z,,...,Z,_, are distributed independently 
according to N(0,£) and independent of X (Theorem 3.3.2). Here 
we want to show that Aj). is distributed as END{-'"-4U U!. where 
U,...-,Uy-1-¢p-q) are distributed independently according to N(O. Z4,.2) 
an independently of B. The distribution of a partial correlation coetficient 
will follow from the characterization of the distribution of A,, ,. We state the 
theorem in a general form; it will be used in Chapter 8, where we treat 
regression. in detail. The following corollary applies it to A,,., expressed in 
terms of residuals. 


Theorem 4.3.3. Suppose Y,,...,¥,, are independent with Y, distributed 
according to N('w,, ®), where w, is an r-component vector. Let H = L" 
assumed nonsingular, G = L"_,Y,w! H™', and 


t 
Was 


a= 


(18) C= ¥ (Y,-Gw,)(¥, - Gw,) = EY ~ GHG’. 
a=] a= 


Then C is distributed as LT; U,U,, where U,,...,U, are independently 


a=} cram 


distributed according to N(Q, ®) and independently of G. 


Proof. The rows of Y=(Y,,...,¥,,) are random vectors in an m-dimen- 
sional space, and the rows of W = (w,,...,w,,) are fixed vectors in that space. 
The idea of the proof is to rotate coordinate axes so that the last r axes are 
in the space spanned by the rows of W. Let E,= FW, where F is a square 
matrix such that FHF’ =I. Then 


m 
(19) E,E, = FWW'F'’ =F >) ww! F’ 


= FHF’ =], 


Thus the m-component rows of E, are orthogonal and of unit length. It is 
possible to find an (m —r) Xm matrix E, such that 


(20) E= ie 


142 SAMPLE CORRELATION COEFFICIENTS 


is orthogonal. (See Appendix, Lemma A.4.2.) Now let U= YE’ (ie., U, = 
ee1 Cag ¥g)- By Theorem 3.3.1 the columns of U = (U,,...,U,,) are indepen- 
dently and normally distributed, each with covariance matrix ®. The means 
are given by 


(21) &U = &YE' =TWE' 
=TF'E,(E, E}) 
=(0 TF") 


by orthogonality of E. To complete the proof we need to show that C 
transforms to £7) U,U'. We have 


a=|™ava’ 

n m 
(22) 3° Y,Y! = YY’ = UEE'U’ = UU' = YUU. 

a=} a=1 
Note that 
(23) G = YW'H"! = UEE,( F"!)'F'F 

ul © | evr 
= E. 3 


where U' =(U,,_,.,,-.-,U,,). Then 


ioe (0) 


(24) GHG' = U?FHF'UC"’ =U9%UC" = YUU; 


Thus C is 


ave m 


(25) LY,Y-GHG'= Yuu- yy U.u,= Ye U,o,. 
aol a | wo mmar eh ae=l 


This proves the theorem. | 


It follows from the above considerations that when [ =0, the &U =0, and 
we obtain the following: 


Corollary 4.3.1. Jf I’ =0, the matrix GHG' defined in Theorem 4.33 is 
distributed as L”_,,-,,,U,U,, where U,,_,,,,-.-,U,, are independently dis- 
tributed, each according to NO, ®). 


4.3. PARTIAL CORRELATION COEFFICIEN’?S 143 


We now find the distribution of A,,.. in the same form. It was shown in 
Theorem 3.3.1 that A is distributed as L*7/Z,Z‘,, where Z,,...,Z,., are 
independent, each with distributio1 N(O0, £). Let Z, be partitioned into two 
subvectors of g and p — q components, respectively: 


Zn 
(26) z= zo | 
Then 4,, = DN_, ZZ". By Lemma 4.2.1, conditionally on Zf? = 
z,...,Z0, =z9),, the random vectors Z{”,...,Z9, are independently 


distributed, wih Z\)) distributed according to N(B2®, Z)).2), where B= 
YX) and Ly). = 2, — YZ’ L.,. Now we apply Theorem 4.3.3 with 
ZW) =Y,, 2 =w,, N-1l=m, p-q=r,B=YP, 2)..=, Ay ae ete Ae Ab 
A,, Az! =G, Ay =H. We find that the conditional distribution of A,,— 
(Ay, AZ!) Ay (ADA) = Ay. given 22 =2, a=1,...,N—1, is that of 
DArL COU, UL, where U,,...;Uy—1-¢ pq) are independent each with dis- 
tribution N(O,Z,)..). Since this distribution does not depend on {z), we 
obtain the following theorem: 


Theorem 4.3.4, The matrix A,,,=A,,; —A,.A3,'A2, is distributed as 
CNT OULU, where U,,...,Uy-4-( pq) are independently distributed, each 
according to N(0, %,,.2), and independently of A,, and Ay. 


Corollary 4.3.2. [f 2%, =0 (B=0), then Auta is distributed as 
Lh o iGo MU Ut and A,,Az)A,, is distributed as LNZ}. (pq) aUa, where 
U,,...,Uy_, are independently distributed, each according to N(O, 2 ,;.2). 


Now it follows that the distribution of r;;..,;,...., based on N observations 
is the same as that of a simple correlation coefficient based on N —(p — q) 


observations with a corresponding population correlation value of pi.g41,..,, 


Theorem 4.3.5, [f the cdf of r,, based on a sample of N from a normal 
distribution with correlation p,; is denoted by F(r|N, p,;), then the cdf of 
the sample partial correlation 7441, ..» based on a sample of N from a 
normal distribution with partial correlation coefficient pygit, .p Flr|N —- 
€p — @)s Pijgst,....p 


This distribution was derived by Fisher (1924), 


vance 


43.3. Tests of Hypotheses and Confidence Regions for Partial 
Correlation Coefficients 


Since the distribution of a sample partial correlation r,).,,;,,.,, based on a 


sample of N from a distribution with population correlation Pigg eiiscs 


144 SAMPLE CORRELATION COEFFICIENTS 


equal to a certain value, p, say, is the same as the distribution of a simple 
correlation r based on a sample of size N — (p — q) from a distribution with 
the corresponding population correlation of p, all statistical inference proce- 
dures for the simple population correlation can be used for the partial 
correlation. The procedure for the partial correlation is exactly the same 
except that N is replaced by N —(p —q). To illustrate this rule we give two 
examples. 


Example 1. Suppose that on the basis of a sample of size N we wish to 
obtain a coufidence interval for p;;.44;,.,,,,- The sample partial correlation is 
Tifqttup: Lhe procedure is to use David's charts for N —(p—q). In the 
example at the end of Section 4.3.1, we might want to find a confidence 
interval for p,,., with confidence coefficient 0.95. The sample partial correla- 
tion is 7,5., = 0.759. We use the chart (or table) for N — (p — q) = 20-1= 19. 
The interval is 0.50 < pj5., < 0.88. 


Example 2. Suppose that on the basis of a sample of size N we use 
Fisher’s z for an approximate significance test of p94) 


p= Po against 
two-sided alternatives. We let 


pence 


= bisa FTipgt tse 
: a ne eer 
(21) 1+ 
1 Po 
to= 21087 — 5 


Then yN -(p-q) —3(z- Zp) is compared with the significance points of 
the standardized normal distribution. In the example at the end of Section 
4.3.1, we might wish to test the hypothesis p,3.. = 0 at the 0.05 level. Then 
€, = 0 and ¥20— 1 — 3 (0.0973) = 0.3892. This value is clearly nonsignificant 
(|0.3892| < 1.96), and hence the data do not indicate rejection of the null 
hypothesis. 

To answer the question whether two variables x, and x, are related when 
both may be related to a vector x =(x,,...,x,), two approaches may be 
used. One is to consider the regression of x, on x, and x® and test whether 
the regression of x, on x, is 0. Another is t> test whether py.3_, = 0. 
Problems 4.43-4.47 show that these approaches lead to exactly the same test. 


4.4. THE MULTIPLE CORRELATION COEFFICIENT 


4.4.1. Estimation of the Multiple Correlation Coefficient 


The population multiple correlation between one variate and a set of variates 
was defined in Section 2.5. For the sake of convenience in this section we 
shall treat the case of the multiple correlation between X, and the vector 


44 THE MULTIPLE CORRELATION COEFFICIENT 145 


X =(X),...,X,)'; we shall not need subscripts on R. The variables can 
always be numbered so that the desired multiple correlation is this one (any 
ivrelevant variables being omitted). Then the multiple correlation in the 
population is 


a) Re BO EaB_ [PEE [uz ow 


¥o71,B’ 228 


where B, o,,, and 2.) are defined by 


(2) z= « a) 


(3) B= L7 Fy). 


Given a sample x,,...,xy (N > p), we estimate & by S=(N/(N — 1)}= or 


(4) va4 ne Seal" 
wa 5 een, 9 x, T4# = a A , 
nA a= G1) 2a 


and we estimate B by B= ee Gy) = Az! a). We define the sample multiple 
correlation coefficient by 


B' BSB | S22 Fwy Gy nt 
(5) ; ; 
I Ul 


That this is the maximum likelihood estimator of R is justified by Corollary 
3.2.1, since we can define R,o,,), %2. as a one-to-one transformation of %. 
Another expression for R [see (16) of Section 2.5] follows from 


(6) 1~-R?= = ee 

Gy| Za 4,129 

The quantities R and B have properties in the sample that are similar to 
those R and B have in the population. We have analogs of Theorems 2.5.2, 
2.5.3, and 2.5.4. Let £,,=%, + 6x? —%™), and x*,=x,,—2%,, be the 
residual. 


Theorem 4.4.1. The residuals x}, @re uncorrelated in the sample with the 
components of x, a=1,...,N. For every vector a 


(7) E [eens B'(x2)-2)] < < E [nant zy a(x) — 3)". 


146 SAMPLE CORRELATION COEFFICIENTS 


The sample correlation berween x, and a’'x, a=1,...,N, is maximized for 


a? 


a=. and that maximum correlation is R. 


Proof. Since the sample mean of the residuals is 0, the vector of sample 
covariances between x*, and x‘ is proportional to 


Nv 


(8) Pal Xa —F)) — BYP — FOF (xO -— 2) = ay - BAy = 0. 


The right-hand side of (7) can be written as the left-hand side plus 


(9) © [(8-a)'(x? -2)]’ 
az] 
= (fp — a)’ > (22 — #) (2 - 20)'(p a), 


which is 0 if and only if @=B. To prove the third assertion we conside1 the 
vector @ for which DX. ,[a’(e® —2)P = D8_ [B'(x@ — x), since the 
correlation is unchanged when the linear function is mulitplied by a positive 
constant. From (7) we obtain 


P 
(10) a, -2 E(x, -¥)) B(x - 2°) + 


as | 


[8 (2? -20)[ 


ae 


N N 
<ay,—2 y (1, =x )@ (a2, a?) - eB [ a’ (x? = 2/7, 


a=] a=] 


trom which we deduce 


(i) we aa 7 EN (Xq—F1) (xP — 2) 8 


Sa far(eg 30) an D8. [B(x -2>)] 


a, 
Va, VB Anb 


which is (5). a 


Thus *, + B(x —%”) is the best linear predictor of x,, in the sample, 
and B’x!-’ is the linear function of x‘ that has niaximum sample correlation 


44 THE MULTIPLE CORRELATION COEFFICIENT 147 


with x,,. The minimum sum of squares of deviations [the left-hand side of 
(7)] is 


a 2 # i 
(12) 1B (te ¥1) — B'(x2 —2)| =4,— B'AyB 
a=l1 


a eee eae | 
= @, — Ay An aay 


= @y1.2 


as defined in Section 4.3 with g = 1. The maximum likelihood estimator of 
OTy.2 is Oi.2 — Qy.2/N. It follows that 


(13) Gi.2 = (1—- RR?) Gy. 


Thus 1—R? measures the proportional reduction in the variance by using 
residuals. We can say that R? is the fraction of the variance explained by x. 
The larger R? is, the more the variance is decreased by use of the explana- 
tory variables in x. 

In p-dumensional space x,,...,X, represent N points. The sample regres- 
sion function x, =X, + B’(x® —xX) is the (p — 1)-dimensional hyperplane 
that minimizes the squared deviations of the points from the hyperplane, the 
deviations being calculated in the x,-direction. The hyperplane goes through 
the point x. 

In N-dimensional space the rows of (x),...,x,) represent p points. The 
N-component vector with ath component x,;,—X, is the projection of the 
vector with ath component x,, on the plane orthogonal to the equiangular 
line. We have p such vectors, a’(x® — ¥™) is the ath component of a vector 
in the hyperplane spanned by the last p — 1 vectors. Since the right-hand side 
of (7) is the squared distance between the first vector and the linear 
combination of the last p—1 vectors, B’(x@ — ¥) is a component of the 
vector which minimizes this squared distance. The interpretation of (8) is that 
the vector with ath component (x,, —¥,) — 6'(x@ — ¥®) is orthogonal to 
each of the last p — 1 vectors. Thus the vector with ath component B'(x2 — 
x) is the projection of the first vector on the hyperplane. See Figure 4.5. 
The length squared of the projection vector is 


(14) 


R 
ifde 
_ 
newts 
—~ 
* 
RD 
| 
Pad 
3 
ey 
Ww 
II 
newts 
a 
Ww 
Ww 
mort 


= -I 
= @yAx Gay» 
I 


and the length squared of the first vector is 24_,(x,, —,)? = a. Thus R is 
the cosine of the angle between the first vector and its projection. 


148 SAMPLE CORRELATION COEFFICIENTS 


(21,721, °°*, 1n ~¥1) 


(::m1—-¥2,°+*, xen ~¥2) 
Bix =™, o00 Due) 


(x9,~%9, toile *,2an ~Xg) 


1 


Figure 4.5 


In Section 3.2 we saw that the simple correlation coefficient is the cosine 
of the angle between the two vectors involved (in the plane orthogonal to the 
equiangular line). The property of R that it is the maximum correlation 
between x,, and linear combinations of the components of x vorresponds 
to the geometric property that R is the cosine of the smallest angle between 
the vector with components x,, —X, and a vector in the hyperplane spanned 
by the other p — 1 vectors. 

The geometric interpretations are in terms of the vectors in the (N — 1)- 
dimensional hyperplane orthogonal to the equiangular line. It was shown in 
Section 3.3 that the vector (x,; —¥,,...,%;n~ —*,) in this hyperplane can be 
designated as (z;,,...,2Z; y-1), where the Z;, are the coordinates referred to 
an (N — 1)-dimensional coordinate system in the hyperplane. It was shown 
that the new coordinates are obtained from the old by the transformation 

Zig = LpntOxpXips @=1,...,N, where B=(b,,) is an orthogonal matrix 
with last row (1/VN,.. “/ YN WN). Then 


N 
(15) ai, = (x —; (Xa —¥ -5 ZigZ jae 


It will be convenient to refer to the multiple correlatior defined in terms of 
Ziq aS the multiple correlation without subtracting the means. 

The population multiple correlation R is essentially the only function of 
the parameters jz and & that is invariant under changes of location, changes 
of scale of X,, and nonsingular linear transformations of X®, that is, 
transformations X}* = cX, +d, X°* = CX +d. Similarly, the saitiple multi- 
ple correlation coefficient R is essentially the only function of x and 3, the 


44 THE MULTIPLE CORRELATION COEFFICIENT 149 


sufficient set of statistics for wand &, that is invariant under these transfor- 
mations. Just as the simple correlation r is a measure of association between 
two scalar variables in a sample, the multiple correlation R is a measure of 
association between a Scalar \ariable and a vector variable in a sample. 


4.4.2. Distribution of the Sample Multiple Correlation Coefficient 
When the Population Multiple Correlation Coefficient Is Zero 


From (5) we huve 


a, Axa 
J 22 “¢l 
(16) R? =. 
ay, 
then 
’ -1 ’ -{ 
Q,Az;, 4a a a A>,a 
(422 41) T (N422 41) ayy a 
(17) 1-R* =| - ——— = = 
ay, ay ay, 
and 


: R? @, Aj ay), 
18 fe ee 
oe, Tom Gis 


For q = 1, Corollary 4.3.2 states that when B = 0, that is, when R=0. Ay, 1S 
distributed as LYIPV and aj.) Az; a, is distributed as LYIN_,.,V,. where 
V,,...,Vy., are independent, each with distribution N(Q,o;,.). Then 
Q)3../0}).. and GAR Ay) / Or , are distributed independently as y*-varia- 
bles with N —p and p — 1 degrees of freedom, respectively. Thus 


19 ae NE es Se, OE 
(19) _ 1-R? p-l 412/05, 2 p-1 
_ Xe-1 Nop 
See 
=P y-lN=p 


has the F-distribution with p — 1 and N ~ p degrees of freedom. The densits 
of F is 


(20) 


T[3(N-1)] = i 


dtp-1)-1 p-} 
SECEDE ea 


150 SAMPLE CORRELATION COEFFICIENTS 


Thus the density of 


(21) R= 


r-1)) 
Fite DINAN py] 


RP-2(1 ~ R2RN PY E O<R<l. 


(22) 2 


Theorem 4.4.2. Let R be the sample multiple correlation coefficient {de- 
fined by (5)] between X, and X' =(X),..., X,) based on a sample of N from 
N(w, 2). If R-=0 (that is, if (o),...,01,)' =0=Bl, then [R?/(1 ~ R?)]- 
[(N — p)/( p — 1)] is distributed as F with p~1 and N —p degrees of freedom. 


It should be noticed that p —1 is the number of components of X@ and 
that N - p=N-—(p-—1)-1. Ifthe multiple correlation is between a compo- 
nent X, and q other components, the numbers are g and N-q-— 1. 

It might be observed that R?/(1—R?) is the quantity that arises in 
regression (or least squares) theory for testing the hypothesis that the 
regression of X, on X3,..., X, is zero. 

If R#0, the distribution of R is much more difficult to derive. This 
distribution will be obtained in Section 4.4.3. 

Now let us consider the statistical problem of testing the hypothesis 
H:R=0 on the basis of a sample of N from N(p, %).[R is the population 
multiple correlation between X, and (X,,..., X,).] Since R > 0, the alterna- 
tives considered are R > 0. 

Let us derive the likelihood ratio test of this hypothesis. The likelihood 
function is 


% KY 1 1 ~ kyr ye-l * 
(23) L(p*.& )* On eM sro 2 he (te> HY’ (*,— B*) 


The observations are given; L is a function of the indeterminates p*, %*. Let 
w be the region in the parameter space 2 specified by the null hypothesis. 
The likelihood ratio criterion is 


me, EE =) 
(24) ieee LUE 


Tha a 20 3 


4.4 THE MULTIPLE CORRELATION COEFFICIENT 151 


Here 2 is the space *, %* positive definite, and w is the region in this 
space where R= Winks T1) /You = 0, that is, where (237 Gq) = 0. 
Because %5,! is positive definite, this condition is edunant) to oq) — 4. The 
maximum of L(p*, &*) over 9 occurs at p* = fi =¥ and S* =S=(1/N)A 
=(1/N)LN_\ (x, — ¥M x, —¥)’ and is 


N72 en 


(25) max L(p*,% ")* Cony Pal 


pt, L*eEn 


In @ the likelihood function is 
(26) Uw, 3*lo8, = 0) - —4.—rem|- 2 Fue aty/ot 
(20) ost aa 


1 
—+- 
(2a)? ) [>3,12" 


N 
| 3 u x (x — pw) E45 xP — v) 


The first factor is maximized at wi = 2, =x, and of = of, = (1 /N)ay, and 
the second factor is maximized at pO*= f= and %%,=%,,= 
(1/N)A,,. The value of the maximized function is 


NEN pNP DN g-iPp-DN 


(27) max Lie. 2°): = 
p*, Yew ( ) (2ar)?™ ai (2ar)?? bi Te 


Thus the likelihood ratio criterion is [see (6)] 


A] 2% 2,24 
28 A= Soa SE (IS RR 
(28) a lAnl™ ( ) 


The likelihood ratio test consists of the critical region A < Ag, where Aj, is 


chosen so the probability of this inequality when R= 0 is the significance 
level a. An equivalent test is 


(29) 1-A/%=R*>1-a7/®, 


Since [R?/(1 — R*)CN — p)/(p — 1)] is a monotonic function of R, an 
equivalent test involves this ratio being: larger than a constant. When R = 0, 
this ratio has an F,_, y_,-distribution. Hence, the critical region is 


R?  N-p 
TR? pai 7 Fr-tw-ol@), 


(30) 


where F,_, y_p(@) is the (upper) significance point corresponding to the a 
significance level. 


152 SAMPLE CORRELATION COEFFICIENTS 


Theorem 4.4.3. Given a sample X,,...,xXy from N(p,%), the likelihood 
ratio test at significance level a for the hypothesis R=0, where R is the 
population multiple correlation coefficient between X, and (X,..., X,), is given 
by (30), where R is the sample multiple correlation coefficient defined by (5). 


As an example consider the data given at the end of Section 4.3.1. The 
sample multiple correlation coefficient is found from 


1 rig V3 
; (4 1.00 0.80 —0.40 
al 3 0.80 1.00 —0.56 
E —0.40 —-0.56 1.00 
1- R? = ——S—-—___ = = 0. : 
(31) | 100 056 0.357 
~0.56 1.00 


Thus R is 0.802. If we wish to test the hypothesis at the 0.01 level that hay 
yield is independent of spring rainfall and temperature, we compare the 
observed [R?/(1 — R?)I[(20 — 3)/@G — D]=15.3 with F, (0.01) = 6.11 and 
find the result significant; that is, we reject the null hypothesis. 

The test of independence between X, and (X,,...,X,) =X" is equiva- 
lent to the test that if the regression of X, on x (that is, the conditional 
expected value of X, given X,=x,,...,X,=x,) is w+ B(x — p™), the 
vector of regression coefficients is 0. Here B = Ad! aqy is the usual least 
squaies estimate of f with expected value B and covariance matrix o4,.7A>! 
(when the X are fixed), and @,,.,/(N — p) is the usual estimate of o1,.. 
Thus [see (18)] 


(32) 


is the usual F-statistic for testing the hypothesis that the regression of X, on 
Xy,---,X, IS 0. In this book we are primarily interested in the multiple 
correlation coefficient as a measure of association between one variable and 
a vector of variables when both are random. We shall not treat problems of 
univariate regression. In Chapter 8 we study regression when the dependent 
variable is a vector. 


Adjusted Multiple Correlation Coefficient 

The expression (17) is the ratio of @,,.., the sum of squared deviations from 
the fitted regression, to @,,, the sum of squared deviations around the mean. 
To obtain unbiased estimators of o,, when B=0 we would divide these 
quantities by their numbers of degrees of freedom, N—p and N-1, 


44 THE MULTIPLE CORRELATION COEFFICIENT 153 


respectively. Accordingly we can define an adjusted multiple correlation coeffi- 
cient R* by 


pwr Fu2/(N-p)  N-1 _ pF 
(33) R= yey Nee 


which is equivalent to 
(34) R*? = R?— FO" (\ — R?), 


This quantity is smaller than R? (unless p = 1 or R? = 1). A possible merit to 
it is that it takes account of p; the idea is that the larger p is relative ta N, 
the greater the tendency of R* to be large by chance. 


4.4.3. Distribution of the Sample Multiple Correlation Coefficient When the 
Population Multiple Correlation Coefficient Is Not Zero 


In this subsection we shall find the distribution of R when the null hypothe- 
sis R= 0 is not true. We shall find that the distribution depends only on the 
population multiple correlation coefficient R. 

First let us consider the conditional distribution of R?/(l — R*)= 
Ay Ana) /a,.. given Z=20, a=1,...,. Under these conditions 
Zi\,--+2 24, are independently distributed, Z,, according to N(B’z'”, 0, +), 
where B= Xz) 04) and oy. = 04 — Fy X=! Gy. The conditions are those 
of Theorem 4.3.3 with ¥,=Z,,, T=B’, w,=20, r=p-1, ®=oy., 
m=n. Then a@y,..= 4), — @ Aja), Corresponds to L_ | ¥, ¥; - GHG’. and 
Qy2/F 4. has a x*-distribution with m—(p—1) degrees of freedom. 
My An 4 = (AZ a1)’ Ay (AZ 4) corresponds to GHG’ and is distributed 


as £ U2, a=n—(p-i)+],...,n, where VarU,) = o,.. and 
(35) (Uy peg eeeerth) = PF", 


where FHF’ =1(H=F"'(F')"']. Then @,)A)'@)/o.) is distributed as 
L(U,/ You 2)’, where VatU,/ Yo...) = 1 and 


2 
: éU AT' 
(36) L o-), PR Se 
a=n-p+2 \ YOu.2 O12 O71 2 
_ BAnB 
yr 


Thus (conditionally) @),42) @)/oy,> has a noncentral ) ?-distribution with 


134 SAMPLE CORRELATION COEFFICIENTS 


p-—1 degrees of freedom and noncentrality parameter B'4,,B/0,,.,. (See 
Theorem 5.4.1.) We are led to the following theorem: 


Theorem 4.4.4. Let R be the sample multiple correlation “laa between 
Xqy and XO" = (21,,..., X,) based on N observations (x11, x1), -..5epy, XW). 
The conditional distributlon of [R?/(. ~ R2LN — p)/Cp — 1) given x fixed 
is noncentral F with p—1 and N-—p degrees of freedom and noncentrality 
parameter B' Ax»B/ 0 4,.2- 


The conditional density (from Theorem 5.4.1) of F =[R?/(1— R?)][(N — 
P)/Cp — YD) is 


(P= Lyexp[- 3B'AnB/oi,.2| 
~ (N=p)T[E(N=p)] 


[BAB "(par nya) +a 


(37) 


> 2044.2 
2 i(N-1) +a : 
a a s(p~1) +a]fi+ GOL] 


and the conditional density of W=R* is (df=[((N—p)/(p- IMI -— 
w)7? dw) 


exp[— BR’ AnB/ou 2] 2 


(38) —Tw=ay) i 


= (Bazb) whe neetrtiw—1) +a] 
ML-2 
a al[i(p-) +a] 
To obtain the unconditional density we need to multiply (38) by the density 
of Z(....,Z to obtain the joint density of W and Z(”,...,Z@ and then 


integrate with respect to the latter set to obtain the marginal density of W. 
We have 


(39) B'4yB BLL, 222 'B 


T11-2 O74 -2 


7 3 | Biz?) |. 
a=l\ Vou? 


4.4 THE MULTIPLE CORRELATION COEFFICIENT 155 


Since the distribution of Z? is N(0,%..), the distribution of B’Z?/ Voy. 
is normal with mean zero and variance 


Pr 
V2.2 Fih.2 
= B'>B = B'S .B/o; 
o1,-B'2yB  1-BP'2,B/o;, 
R2 
1 -R? 


Thus (B’ AnB/o, 2)/LR? /( — R?)] has a x?-distribution with n degrees of 
freedom. Let R?/(1 — R*)= ¢. Then B’A,,B/o;,.. = 6x2. We compute 


2 
= +o it PXn 
(41) &e° 2% ( 5 


~~ 
R 


a = ot 
prite le 2 diy 


_._ o4 ibid ii 1 
(i+¢y"t* Tn) fo arnt a) 
b% T(jn+a) 

(i+g)"t* TGn) ° 


Applying this result to (38), we obtain as the density of R? 


(oo) SRY GSR SRR Gn) 
Ts(m—p+ yl Gn) 2 MIP[s(p-l) tp 
Fisher (1928) found this distribution. It can also be written 
r(4n)(1 — R?)" 
43 
(3) Ti —ptDIFE@-D 


(ROP aa R2)KR-P- DD 


-F | 5n, 4m; 3( p — 1); R°R’], 


where F is the hypergeometric function defined in (41) of Section 4.2. 


156 SAMPLE CORRELATION COEFFICIENTS 


Another form of the density can be obtained when n — p + 1 is even. We 
have 


peace 


os Mp=1) +2] 
a r (RR) ) T(dn tw) (2) ee 
w=0 t=] 


I 


(2) en ok CR?R?)" P Gn +m) 


0 pe rs 3n) P(an) 


t=] 


“u=0 


F 3(n-p+1) ; wart. Geol 
=T(in)($] t?"-1(1 — tR?R?) 


N 
_ 


The density is therefore 


(1 _ R2)*"( R21? Hy = R2y HPD) 


mY Tip +d 


3(n-ptl) =e 
{3] t#"-1(1 — 1R?R?) 


t=] 

Theorem 4.4.5. The density of the square of the multiple correlation coeffi- 
cient, R*, between X, and X),... ,X, based on a sample of N=n + 1 is given 
by (42) or (43) lor (45) in the case of n—p+1 even], where R? is the 
corresponding population multiple correlation coefficient. 


The moments of RF are 


(46) éRh= Go RY (R’) T?Gn + a) 


oT iR(n-p +] Gn) 2 Tla(p- 1) +4] o! 
[0 SRE Ree d( R?) 
0 


_ 1-2)" ia (R?)"T?(in + w)T[h (pth—1) +p] 


T'(4n) woo MIT S(p-l+pll[s(nth) ty 


The sample multiple correlation tends to overestimate the population 
multiple correlation. The sample multiple correlation is the maximum sample 
correlation between x, and linear combinations of x and hence is greater 


4.4 THE MULTIPLE CORRELATION COEFFICIENT 157 


than the sample correlation between x, and B'x‘?); however, the latter is the 
simple sample correlation corresponding to the simple population correlation 
between x, and B’x™, which is R, the population multiple correlation. 

Suppose R, is the multiple correlation in the first of two samples and B, 
is the estimate of B; then the simple correlation between x, and p',x® in 
the second sample will tend to be less than R, and in particular will be less 
than R,, the multiple ccrrelation in the second sample. This has been called 
“the shrinkage of the multiple correlation.” 

Kramer (1963) and Lee (1972) have given tables of tlic upper significance 
points of R. Gajjar (1967), Gurland (1968), Gurland and Milton (1970), 
Khatri (1966), and Lee (1917b) have suggested approximations to the distri- 
butions of R?/(1 — R?) and obtained large-sample results. 


4.4.4, Some Optimal Properties of the Multiple Correlation Test 


Theorem 4.4.6. Given the observations x,,...,Xy from N(p, 2), of all tests 
of R=0 at a given significance level based on ¥ and A= UN_ (x, —¥Xx,—X)' 
that are invariant with respect to transformations 


xe=ck, +d, xP* = CH +4, 
wo i =e! *,=0Ca A’, = CAy,C’ 
a =C°a),, Gy) = CCE), 2 nl, 


any critical rejection region given by R greater than a constant is unifornily most 
powerful. 


Proof. The multiple correlation cocfficient R is invariaut under the trans- 
formation, and any function of the sufficient statistics that is invariant is a 
function of R. (See Problem 4.34.) Therefore, any invariant test must be 
based on R. The Neyman-Pearson fundamental lemma applied to testing 
the null hypothesis R = 0 against a specific alternative R = R, > 0 tells us the 
most powerful test at a given level of significance is based on the ratio of the 
density of R for R = R,, which is (42) times 2 R [because (42) is the density of 
R?], to the density for R = 0, which is (22). The ratio is a positive constant 
times 


is (Ro) T?2(Gn tx) pau 
2 X wile Day 


pao Me 
Since (48) is an increasing function of R for R>0, the set of R for which 
(48) is greater than a constant is an interval of R greater than a constant. 
| 


158 SAMPLE CORRELATION COEFFICIENTS 


Theorem 4.4.7. On the basis of observations x,,...,Xy from N(p, %), of 
all tests of R = 0 at a given significance level with power depending only on R, the 
test with critical region given by R greater than a constant is uniformly most 
powerful. 


Theorem 4.4.7 follows from Theorem 4.4.6 in the same way that Theorem 
5.6.4 follows from Theorem 5.6.1. 


4.5. ELLIPTICALLY CONTOURED DISTRIBUTIONS 


4.5.1. Observations Elliptically Contoured 


Suppose x,....,%, are N independent observations on a random p-vector X 
with density 
(1) |Al~?g[(x—v)'A7}(x-v)]. 


The sample covariance matrix S$ is an unbiased estimator of the covariance 
matrix XY =[R*/p]A, where R?=(X-—v)'A7'(X—v) and &R* <oo. An 


estimator of Py = Fy Lf Ta Fy = Ay/ VY udy is ry = Sj/ YSus ale 


1...., p. The small-sample distribution of r,, is in general difficult to obtain, 
but the asymptotic distribution can be obtained from the limiting normal 
distribution of Y/N(S — 2) given in (13) of Section 3.6. 

First we prove a general theorem on asymptotic distributions of functions 
of the sample covariance matrix $ using Theorems 4.2.3 and 3.6.5. Define 


(2) s =vec S, o =vec >. 


Theorem 4.5.1. Let f(s) be a vector-valued function such that each compo- 
nent of f(s) has a nonzero differential at s = 0. Suppose S is the covariance of a 
sample from (1) such that &R* < 00. Then 


(3) YNLF(s) f(y] = “ESL YW (5-0) +0,(1) 
— * wo 0, LD) fo +k)(L@L)+koo 1{ is} I 
Corollary 4.5.1. Jf 


(4) f(cs) =f(s) 


for all c > 0 and all positive definite S and the conditions of Theorem 4.5.1 hold, 
then 


(5) NL¥(s) -f(0)] 4 fo 21+ ihe) A(z 0x)( 4 =.) 


45 ELLIPTICALLY CONTOURED DISTRIBUTIONS 159 


Proof. From (4) we deduce 


_ af(cs) — af(es) aes) _ af (es 
(6) je as oc AS) 
That is, 

(7) EO) 5 <9, a 


The conclusion of Corollary 4.5.1 can be framed as 


(8) FLV) -s(0)] 4y[0,24G)(20x( 22) 


The limiting normal distribution in (8) holds in particular when the sample is 
drawn from the normal distribution. The corollary holds true if « is replaced 
by a consistent estimator x. For example, a consistent estimator of 1+ x 
given by (16) of Section 3.6 is 


N 
(9) l+k= L [(x2-8)'S(.-8)]'/LNp(p + 2)1, 


A sample correlation such as f(s)=r,,=5,;/ ¥5,8;; Or a set of such 
correlations is a function oj S§ that is invariant under scale transformations; 
that is, it satisfies (4). 


Corollary 4.5.2. Under the conditions of Theorem 4.5.1, 
(ri; (ry = Puy) Gr 


ik rere 


As in the case of the observations normally distributed, 


Oo) ire 


Of course, any improvement of (11) over (10) depends on the distribution 
samples. 

Partial correlations such as r; 
functions of S. 


(10) 5 N(0,1). 


1, 1lt+n; 1, 14+ 2,;) a 
zest, ~ 5 logs oe — N(0,1). 


ihath...p» DJ =1,.--,@, are also invariant 


Corollary 4.5.3. Under the conditions of Theorem 4.5.1, 


N d 
(12) ees (Tij.g +1 hee p Pipqtlien., -) > N(@,1). 


160 SAMPLE CORRELATION COEFFICIENTS 


Now let us consider the asymptotic distribution of R*, the square of the 
multiple correlation, when R?, the square of the population multiple correla- 
tion, is 0. We use the notation of Section 4.4. R? = 0 is equvialent to 0, = 0. 
Since the sample and population multiple correlation coefficients between 
X, and X%=(X),..., X,)’ are invariant with respect to linear transforma- 
tions (47) of Section 4.4, for purposes of studying the distribution 
of R? we can assume »=0 and = =I[,. In that case s, #1, Si 4,0, and 
Sy) > 1,_. Furthermore, for k,i #1 and j=/=1, Lemma 3.6.1 gives 


, 1 K 
(13) Bsuysin = (5 + Wy lo: 


Theorem 4.5.2. Under the conditions of Theorem 4.5.1 


[ON d 
(14) Tex Su > N(0, Posi): 


Corollary 4.5.4. Under the conditions of Theorem 4.5.1 


(15) NR? z N31) 82918, 2 2 
T+ (1+k)s, Ph 


4.5.2. Elliptically Contoured Matrix Distributions 


Now let us turn to the model 
(16) [Al e[tr(X-eyv')AT(X—eyv’)'| 


based on the vector spherical model g(tr ¥‘Y). The unbiased estimators 
of v and % =(&R*/p)A are =(1/N)X'e, and S=(1/n)A, where A= 
(X—€,X')(X— eX’). 

Since 


(17) (X-eyv')'(X-—eyv') =AtN(X—-—v)(X-Yv)’, 


A and ¥ are a complete set of sufficient statistics. 

The maximum likelihood estimators of v and A are ’=X and A= 
(p/w,)A. The maximum likelihood estimator of ,; = A,/yAuAj, = 
Gif V Gi %j IS By = 41, 494; = 51j/ Y SiS, (Theorem 3.6.4). 

The sample correlation r,, is a function f(X) that satisfies the conditions 
(45) and (46) of Theorem 3.6.5 and hence has the same distribution for an 
arbitrary density g[tr(.)] as for the normal density g[tr(.)] = const e7 , 
Similarly, a partial correlation r;,4,),..,) and a multiple correlation R? 
satisfy the conditions, and the conclusion holds. 


45 ELLIPTICALLY CONTOURED DISTRIBU LIONS 16] 


Theorem 4.5.3. When X has the vector elliptical density (16), the distribu- 
tions Of Fas Tig411 and R° are the distributions derived for normally distributed 
observations. 


It follows from Theorem 4.5.3 that the asymptotic distributions of r,,. 
Fijg+l,....ps and R? are the same as for sampling from normal distributions. 

The class of left spherical matrices Y with densities is the class of g(Y’Y). 
Let X= YC' +e, v', where C'A~'C=1, that is, A=CC'. Then X has the 
density 


(18) )o) “gl O° '(X-eyv')'(X-eyv'(C') |]. 
We now find a stochastic representation of the matrix Y. 


Lemma 4.5.1. Let V= Wise Be), where v, is an N-component vector. 
i= 1, seey De Define recursively Ww, =v), 


(19) w=p- ——wW, 2 as p. 


Let u, =w,/|lw,||. Then |lu,||=1,i=1,...,p, and uu, =0, i #j. Further, 
(20) V=UT', 


where U=(u,,...,u,); ¢,, =Ilw |, f=1...., pP; t, =U w,/|wilaou, j= 
Ligecyt =, Pe dicta DP; andt,,=0,i <j. 


The proof of the lemma is given in the first part of Section 7.2 and as the 
Gram-Schmidt orthogonalization in the Appendix (Section A.3.1). This 
lemma generalizes the construction in Section 3.2; see Figure 3.1. See also 
Figure 7.1. 

Note that T is lower triangular, U'U=1,, and V'V=Y7T'. The last 
equation, ¢,,20,i=1,...,p, and ¢,,=0, i<j, can be solved uniquely for T. 
Thus 7 is a function of V’V (and the restrictions). 

Let ¥Y (NXp) have the density g(¥‘Y). and let O, be an orthogonal 
NXWN matrix. Then Y*=O,Y has the density g(Y*'Y*). Hence Y* = 
OY Y. Let Y* = U*T*', where t* > 0, /=1,..., p, and = 0,8 <j, From 
Y*'y* =Y’'Y it follows that 7*7*'= TT’ and hence 7* = T, Y* = U* T. and 
U* =0,U £U. Let the space of U (N Xp) such that U'U =1, be denoted 
O(N Xp). 


Definition 4.5.1. If U (NXp) satisfies U'U=1, and O.UU for all 
orthogonal Oy, then U is uniformly distributed on O(N X p). 


162 SAMPLE CORRELATION COEFFICIENTS 


The space of U satisfying U’U =I, is known as a Steifel manifold. The 
probability measure of Definition 4.5.1 is known as the Haar invariant 
distribution. The property OyU £U for all orthogonal Oy defines the (nor- 
milized) measure uniquely [Halmos (1956). 


Theorem 4.5.4. If Y (N Xp) has the density g(Y'Y), then U defined by 
Y=UT', U'U=1,, t,>0, i=1,...,p, and 1, = 0, i<j, is uniformly dis- 
tributed on O(N X p). 


The proof of Corollary 7.2.1 shows that for arbitrary g(-) the density of 
T is 


(21) (C[3(N+1-f)] 2" \ectr 77’), 


1 


P 
i= 


where C(-) is defined in (8) of Section 2.7. 
The stochastic representation of Y (N Xp) with density g(Y'Y) is 
(22) Y=UT', 


where U (N Xp) is uniformly distributed on O(N Xp) and T is lower 
triangular with positive diagonal elements and has density (21). 


Theorem 4.5.5. Let f(X) be a vector-valued function of X (N Xp) such 


that 

(23) f(X+t eyv') =f(X) 
for all v and 

(24) f(XG') =f(X) 


for all G (p Xp). Then the distribution of f(X) where X has an arbitrary density 
(18) is the same as the distribution of f(X) where X has the normal density (18). 


Proof. From (23) we find that f(X)=f(YC'), and from (24) we find 
f(¥C) =f(UT'C’) = f(U), which is the same for arbitrary and normal densi- 
ties (18). n 


Corollary 4.5.5. Let f(X) be a vector-valued function of X (N Xp) with 
the density (18), where v =0. Suppose (24) holds for all G (p Xx p). Then the 
distribution of f(X) for an arbitrary density (18) is the same as the distribution of 
f(X) when X has the normal density (18). 


PROBLEMS 163 


The condition (24) of Corollary 4.5.5 is that f(X) is invariant with respect 
to linear transformations X — XG. 
The density (18) can be written as 


(25) 


\C|“'g(C“'L A + N(F—v)(F—v)'](C')"'}, 


which shows that A and X¥ are a complete set of sufficient statistics for 
A=CC' and v. 


PROBLEMS 


41. 


4.2. 


43. 


4.4. 


4.5. 


4.7. 


4.8. 


4.9. 


4.10. 


(Sec. 4.2.1) Sketch 


r3(N—1)] 
T(3N-1)¥a 


$(N-4) 


ky(r) = (1-7) 


for (a) N= 3, (b) N= 4, (ec) N= 5, and (d) N= 10. 


(Sec. 4.2.1) Using the data of Problem 3.1, test the hypothesis that X, and X, 
are independent against all alternatives of dependence at significance level 0.01. 


(Sec, 4.2.1) Suppose a sample correlation of 0.65 is observed in a sample of 10. 
Test the hypothesis of independence against the alternatives of positive correla- 
tion at significance level 0.05. 


(Sec. 4.2.2) Suppose a sample correlation of 0.65 is observed in a sample of 20. 
Test the hypothesis that the population correlation is 0.4 against the alternatives 
that the population correlation is greater than 0.4 at significance level 0.05. 


(Sec. 4.2.1) Find the significance points for testing p= 0 at the 0.01 level with 
N =15 observations against alternatives (a) p #0, (b) p > 0, and (ce) p <0. 


. (Sec. 4.2.2) Find significance points for testing p= 0.6 at the 0.01 level with 


N = 20 observations against alternatives (a) p # 0.6, (b) p> 0.6, and (c) p< 0.6. 


(Sec. 4.2.2) Tablulate the power function at p= —1(0.2)1 for the tests in 
Problem 4.5, Sketch the graph of each power function. 


(Sec. 4.2.2) Tablulate the power function at p= —1(0.2)1 for the tests in 
Problem 4.6. Sketch the praph of each power function. 


(Sec. 4.2.2) Using the data of Problem 3.1, find a (two-sided) confidence 
interval for p;. with confidence coefficient 0.99, 


(Sec. 4.2.2) Suppose N= 10, r=0.795. Find a one-sided confidence interval 
for p lof the form (fo, 1)] with confidence coefficient 0.95. 


164 


4.11. 


4.12. 


4,13. 


4.14. 


4.15. 


4.16. 


4.17. 


4.18. 


4.19. 


4.20. 


SAMPLE CORRELATION COEFFICIENTS 


(Sec. 4.2.3) Use Fisher’s z to test the hypothesis p= 0.7 against alternatives 
py *0.7 at the 0.05 level with r=0.5 and N = 50. 


(Sec. 4.2.3) Use Fisher’s z to test the hypothesis p, = pz against the alterna- 
tives p, # p, at the 0.01 level with r, = 0.5, N, = 40, r, = 0.6, No = 40. 


(Sec, 4.2.3) Use Fisher’s z to estimate » based on sample correlations of -0.7 
(N = 30) and of — 0.6 (N = 40). 


(Sec. 4.2.3) Use Fisher’s z to obtain a confidence interval for » with conff- 
dence 0.95 based on a sample correlation of 0.65 and a sample size of 25. 


(Sec. 4.2.2), Prove that when N = 2 and p=0, Pr{r= 1} = Pr{r= —1}= 4. 


(Sec, 4.2) Let k,y(r, p) be the density of the sample corrclation coefficient r 
for a given value of p and N. Prove that r has a monotone likelihood ratio; that 
is, show that if p, > pa, then kyr, py)/kK,y(r, pz) is monotonically increasing in 
r. [Hint: Using (40), prove that if 


fo «) 


F[z.33n +3330. + pr)] = Y) cg(1 + pr)" =8(r,p) 


a=Q 


has a monotone ratio, then k,(r, p) does. Show 


3 x pao Caa|(a~ BY rot (a+ B)|(1trp)7P? 
Dror Cee) ee a 
p 2[x%_ca(1 +rp) | 


if (97/dp ar)log g(r, p)>0, then g(r, p) has a monotone ratio. Show the 
numerator of the above expression Is positive by showing that for each a the 
sum on f is positive; use the fact that c,,, < $c,.] 

(Sec. 4.2) Show that of all tests of pg against a specific p, (> pg) based on r, 
the procedures for which r> c tmplies rejection are the best. [Hint: This follows 
from Problem 4.16.] 


(Sec, 4.2) Show that of all tests of p=, against p> pg based on r, a 
procedure for which r>c implies rejection is uniformly most powerful. 


(Sec. 4.2) Prove r has a monotone likelihood ratio for r> 0, p> 0 by proving 
h(r) =ky(, p1)/ky (7, pz) is monotonically increasing for p; > p2. Here h(r) ts 
a constant times (Lee o¢g pir?) /(Len9Cg p27). In the numerator of A’(r), 
show that the coefficient of r? is positive. 


(Sec. 4.2) Prove that if 2 is diagonal, then the sets r,, and a,, are indepen- 
dently distributed. [ Hint: Use the facts that r;, is invariant under scale transfor- 


mations and that the density of the observations depends only on the a,,.] 


PROBLEMS 165 


4.21. 


4,22. 


4,23. 


4.24. 


4,25. 


(Sec. 4.2.1) Prove that if p=0 


gym. ELEN = DIFC + 4) 
vaT[i(N-1) +m] ° 


(Sec. 4.2.2) Prove fp) and f,(p) are monotonically increasing functions 
of p. 


(Sec. 4.2.2) Prove that the density of the sample correlation r [given by 
(38)] is 


n-1 bn Mn~3) ft x"! dx 
(hap el aey ——. 
e ) i (l-pry"V¥1 =x? 


[ Hint: Expand (1 — prx)~" in a power series, integrate, and use the duplication 
formula for the gamma function.] 


(Sec. 4.2) Prove that (39) is the density of r. [Hint: From Problem 2.12 show 


CO ,00 -1 
~ xy? -2xy2+29) hy dy = sos (=*) 
I J : y V1—-x? 


Then argue 


CO ,CO t 
{ { (zy! es gOy?—2ayz tz?) dy dz a 
0 “0 
Finally show that the integral of (31) with respect to a), (=y*) and a), (=2°) is 
(39).] 


(Sec. 4.2) Prove that (40) is the density cf r. [Hfint: In (31) let a,, =ue~* and 
@ = ue’, show that the density of v (0 <u< ©) andr (-1<r<l)is 


- 1 $n —n+4 Sa- eit no “3 
ye (t- P*) (L= pry "8 =r? = vy [IL - $1 + pro] 


Use the expansion 


Show that the integral is (40).] 


166 


4.26. 


4.27. 


4.29, 


4.31. 


4.32. 


4.33. 


4.34, 


SAMPLE CORRELATION COEFFICIENTS 


(Sec. 4.2) Prove for integer h 


thet _ (1-07) (py Pt yr ryt (n+ 1) + BIP(h+ B+ 3) 
oe Val(in) rae (2B +1)! T(inth+B+1) 


Pee (Ge ee sy oy" Tan + ByT(A+ B+ 3) 


val (4n) geo 28)! (Gn +h +8) 


(Sec. 4.2) The i-distribution, Prove that if X and Y are independently dis- 
tributed, X having the distribution N(0,1) and Y having the y?-distribution 
with m degrees of freedom, then W = X/yY/m has the density 

r[EEGn +0] (1+ a 

vm Vr T(4m) mi}. 


[ Hint: In the joint density of X and Y, let x = twim7? and integrate out w.] 


. (Sec. 4.2) Prove 


ore Bey” S oT t 1) +B] 
T(4n) x0 BiT[in+ B+ 1] 


[ Hint: Use Problem 4,26 and the duplication formula for the gamma function,] 


(Sec. 4.2) Show that vn (.,,— p;,)) (i, {) = (1,2),(1,3),(2,3), have a joint limit- 
ing distribution with variances ( 7 pi” ai covaiiances of r;; and r,, j #k 


being 5(2 Dy - Pi; PX — pi, ~ Dik ~ Dix) i Pits 


. (Sec. 4.3.2) Find a confidence interval for p,3.. svith confidence 0.95 based on 


ry, >= 0.097 and N = 20, 


(Sec, 4.3.2) Use Fisher's = to test the hypothesis p,..,4 = 0 against alternatives 
Py> 34 #0 at significance level 0.01 with r,..34 = 0.14 and N = 40. 


(See. 4.3) Show that the inequality rj3.,< 1 is the same as the inequality 
[r,,| = 0. where [r;| denotes the determinant of the 3 x 3 correlation matrix. 


(See. 4.3) Jnuariunce of the saniple partial correlation coefficient. Prove that 
r\>.3,,...p is invariant under the transformations x*, = a,x;, + bix) + ¢,, a; > 0, 
i=1,2, x6* =Cx% +b, a=1,...,N, where mol = Gas rere and that 
any function of ¥ and & that is invariant under these transformations is a 
function Of ry2.3, 
(Sec. 4.4) Invariance of the sample muttiple correlation coefficient. Prove that R 
is a function of the sufficient statistics ¥ and § that is invariant under changes 
of location and scale of x,, and nonsingular linear transformations of x (that 
is. x¥, = cx,, +d, xO* = Cx) +d, a=1,...,N) and that every function of x 
and $§ thal is invariant! is a function of R. 


PROBLEMS 167 


4.35. 


4.36. 


4.37. 


4,38, 


4,39. 


4.40. 


4.41. 


4.42. 


(Sec. 4.4) Prove that conditional on Z,,=2z,,, @=1,...,2, R?/(1—R?*) is 
distributed like 7? /( N* — 1), where T* = N*x'S~'x based on N* =n observa- 
tions on a vector X with p* = p — 1 components, with mean vector (c/o1,)o( 
(nc? = Dz?,) and covariance matrix Y= Zp» ~- /o,, 0.04. [ Hint: The 
conditional distribution of Z® given Z,,=2,, is NIC /oy)@q)z1q>2n4h 
There is an n Xn orthogonal matrix B whicli carries (z;;,...,21,) into (¢,...,c¢) 
and (Zj,,...,Z,,) into (Yj,..-,Y¥j,. f= 2,...,p. Let the mew Xi be 
(Yoqs--+y You) 


(Sec. 4.4) Prove that the noncentrality parameter in the distribution in Prob- 
lem 4.35 is (a,,/o,,)R2/(1 — R*). 


(Sec. 4.4) Find the distribution of R*/(1 — R*) by multiplying the density of 
Problem 4.35 by the density of a,,; and integrating with respect to a,,. 


(Sec. 4.4) Show that the density of r? derived from (38) of Section 4.2 is 
identical with (42) in Section 4.4 for p = 2. [ Hint: Use the duplication formula 
for the gamma function.]} 


(Sec. 4.4) Prove that (30) is the uniformly most powerful test of R = 0 based 
on r. [Hint: Use the Neyman-Pearson fundamental lemma.] 


(Sec. 4.4) Prove that (47) is the unique unbiased estimator of R* based on R?. 


The estimates of p and & in Problem 3.1 are 


X= (185.72 151.12 183.84 149.24)’, 


95.2933 52.8683 - 69.6617 46.1117 
52.8683 54.3600: 51.3117 35.0533 


69.6617 51.3117 } 100.8067 56.5400 
46.1117 35.0533 + 56.5400 45.0233 


(a) Find the estimates of the parameters of the conditional distribution of 
(x3, 44) given (x), x2); that is, find 2,8), and Sy, = Sy — $2, Sy4)'Syp. 

(b) Find the partial correlation r44.,.. 

(ec) Use Fisher’s z to find a confidence interval for p3,.;. with confidence 0.95. 

(d) Find the sample multiple correlation coefficients between x. and (x,, x5) 
and between x, and (x), x2). 


(e) Test the hypotheses that x, is independent of (x,, x,) and x, is indepen- 
dent of (x, x.) at significance levels 0.05. 


Let the components of X correspond to scores on tests in arithmetic speed 
(X,), arithmetic power (X,), memory for words (X3), memory for meaningful 
symbols ( 4’,), and memory for meaningless symbols (X.). The observed correla- 


168 


4.43. 


4,44 e 


4.45, 


SAMPLE CORRELATION COEFFICIENTS 


tions in a sample of 140 are [Kelley (1928)] 


1.0000 0.4248 0.0420 0.0215 0.0573 
0.4248 1.0000 0.1487 0.2489 0.2843 
0.0420 0.1487 1.0000 0.6693 0.4662 
0.0215 0.2489 0.6693 1.0000 0.6915 
0.0573 0.2843 0.4662 0.6915 1.0000 


(a) Find the partial correlation between X, and X,, holding X, fixed. 


(b) Find the partial correlation between X, and X2, holding X3, X,, and X; 
fixed. 


(c) Find the multip'e correlation between X, and the set X;, X,, and X.. 


(d) Test the hypothesis at the 1% significance level that arithmetic speed is 
independent of the three memory scores. 


(Sec. 4.3) Prove that if pygit,.p =O, then YN-2-(p—-@)Fijgss,...,p/ 
1 ~ Fi o4t....,9 iSdistributed according to thet-distributionwith N ~ 2~(p—-q) 
degrees of freedom. 


(Sec, 4.3) Let X’ = (X;, X,, X™") have the distribution N(w, Z). The condi- 
tional distribution of X, given X,=x, and X% =x@ jigs 


N[ My + ¥2(%2.- Mz) +y'( x9 cs pw), 1}.2,..., ol: 


ox T% ¥2 F\2 
cies Lally Ty} 


The estimators of yz and y are defined by 


where 


Show ¢y * 419.3... p/@22.3 
substitute. ] 


(Sec, 4.3) In the notation of Problem 4,44, prove 


PROBLEMS 169 


4.46, (Sec. 4.3) Prove that 1/ao.3 


4.47, 


4.48. 


4.49, 


Hint: Use 


dx Ay \ [ o> 
414-2 =ay — (ce, c’ 1 
ee C 7 4a Ayiie 


os » is the eement in the upper left-hand corner 


of 


(Sec. 4.3) Using the results in Problems 4.43-4,46, prove that the test for 
p = 0 is equivalent to the usual t-test for y, =0. 


arree 


Missing observations. Let X =(Y' Z')’, where Y has p components and Z has q 
components, be distributed according to N(p, 2), where 


Zy, Ly: 
eae [3 "|. 


zy 


Let M observations be made on X, and N—M additional observations be made 
on Y, Find the maximum likelihood estimates of wp and 2. [Anderson (1957).] 
[ Hint: Express the likelihood function in terms of the marginal density of Y and 
the conditional density of Z given Y\] 


Suppose X is distributed according to N(0, 2), where 


Show that on the basis of one observation, x’ = (x), x.,%4), we can obtain a 
confidence interval for p (with confidence coefficient 1— a) by using as end- 
points of the interval the solutions in t of 


[x3+ x2 (a)]t?- QW(xpXoHxQKy)t txp+xd +x} - yi (a) =O, 


where y3(q@) is the significance point of the y?-distribution with three degrees 
of freedom at significance level a. 


CHAPTER 5 


The Generalized T?-Statistic 


5.1. INTRODUCTION 


One of the most important groups of problems in univariate statistics relates 
to the mean of # given distribution when the variance of the distribution is 
unknown, On the basis of # sample once may wish to decide whether the 
mean is equal to a number Specified in advance, or one may wish to give an 
interval within which the mean lies. The statistic usually used in univariate 
statistics is the difference between the mean of ithe sample x and the 
hypothetical population mean yw divided by the sample standard deviation s. 
If the distribution sampled is N( . 07), then 


Yo p 
(1) t= ¥N — 


has the well-known f-distribution with N — 1 degrees of freedo n, where N is 
the number of observations in the sample. On the basis of this fact, one can 
set up a test of the hypothesis » = u,, where py is specified, or one can set 
up a confidence interval for the unknown parameter pu. 

The multivariate analog of the square of f given in (1) is 


(2) T?=N(#-p)'S"'(#-p), 


where X is the mean vector of a sample of N, and S§ is the sample covariance 
matrix. It will be shown how this statistic can be used for testing hypotheses 
about the mean vector p of the population and for obtaining confidence 
regions for the unknown wp. The distribution of T? will be obtained when p 
in (2) is the mean of the distribution sampled and when p is different from 


An Introduction to Multivariate Statistical Analysis, Third Edition, By T, W. Anderson 
ISBN 0-471-36091-0 Copyright © 2003 John Wiley & Sons, Inc. 


170 


5.2. DERIVATION OF THE T*-STATISTIC AND ITS DISTRIBUTION 171 


the population mean. Hotelling (1931) proposed the T?-statistic for two 
samples and derived the distribution when p is the population mean. 

In Section 5.3 various uses of the T?-statistic are presented, including 
simultaneous confidence intervals for all linear combinations of the mean 
vector. A James-Stein estimator is given when & is unknown. The power 
function of the T?-test is treated in Section 5.4, and the multivariate 
Behrens—Fisher problem in Section 5.5. In Section 5.6, optimum properties 
of the T?-test are considered, with regard to both invariance and adrissibil- 
ity. Stein’s criterion for admissibility in the general exponential family is 
proved and applied. The last section is devoted to inference about the mean 
in elliptically contoured distributions. 


5.2. DERIVATION OF THE GENERALIZED T?-STATISTIC 
AND ITS DISTRIBUTION 


5.2.1. Derivation of the T?-Statistic As a Function of the Likelihood 
Ratio Criterion 


Although the T*-statistic has many uses, we shall begin our discussion by 
showing that the likelihood ratio test of the hypothesis H:p = po on the 
basis of a sample from N(p, 2%) is based on the T?-statistic given in (2) of 
Section 5.1. Suppose we have N observations x,,...,xX, (N>p). The likeli- 
hood function is 


N 
(1) L(m,%) = (27) PEI" exp| ~ 5 5 (x.-B)'E7(2,- Bw) |. 


The observations are given; L is a function of the indeterminates p, &. (We 
shall not distinguish in notation between the indeterminates and the parame- 
ters.) The likelihood ratio criterion is 


max L (po, %) 


(2) - max L(p, %) , 


that is, the numerator is the maximum of the likelihood function for p, X in 
the parameter space restricted by the null hypothesis (w= py, 2 positive 
definite), and the denominator is the maximum over the entire parameter 
space (Z positive definite). When the parameters are unrestricted, the maxi- 
mum occurs when p, & are defined by the maximum likelihood estimators 


172 THE GENERALIZED T2-STATISTIC 


(Section 3.2) of p and &, 


(3) Do Xs 
_ x 
(4) Zo=57 DL (re 8) 4e-D)- 
a=] 
When p = po, the likelihood function is maximized at 
Q jes 
(5) Ly =H ys (Xy— Bo) %y ~ By)’ 
a=] 


by Lemma 3.2.2. Furthermore, by Lemma 3.2.2 


] 1 
6 max L{p, 2%) = ———7-———- ee N 
( ) en ( ) (2m) PPE) 2% 
: i | 
7 max L y= ——— eh 
( ) 5 (Bo ) (2r)?*| 3, iN 
Thus the likelihood ratio criterion is 
[E,1Y  1L(xg— Bo) (%a — Bo) 
_ || 2 
|A +N(X¥- po)(¥- po)/1’ 
where 
N 
(9) A= }) (x. 7-¥)(x,7-%)' =(N- 1S. 


Application of Corollary A.3.1 of the Appendix shows 


ayn A 
ve : |4 + [VN (=~ po)| [VN (¥- Bo) |’! 
7 l 

7 1+ N(x- Bo)’ AW" (xX * Ho) 
1 
~14+T?/(N-1)’ 
where 


(11) T? =N(¥— po)'S7'(X- po) = (N- I N(X- py)’ A471 — po). 


5.2 DERIVATION OF THE T?-STATISTIC AND ITS DISTRIBUTION * 173 


The likelihood ratio test is defined by the critical region (region of 
rejection) 
(12) ASAq, 
where A, is chosen so that the probability of (12) when the null hypothesis is 


true is equal to the significance level. If we take the 4Nth root of both sides 
of (12) and invert, subtract 1, and multiply by N — 1, we obtain 


(13) earings 
where 
(14) Ty =(N-1)(a9%* — 1). 


Theorem 5.2.1. The likelihood ratio test of the hypothesis w= po for the 
distribution N(w, &) is given by (13), where T? is defined by (11), X iy the mean 
of a sample of N from N(p, &), S is the covariance matrix of the sample, and T; 


is chosen so that the probability of (13) under the null hypothesis is equal to the 
chosen significance level. 


The Student t-test has the property that when testing y= 0 it is invariant 
with respect to scale transformations. If the scalar random variable X is 
distributed according to N( yu, o*), then X* = cX is distributed according to 
N(cp, c’a*), which is in the same class of distributions, and the hypothesis 
éX =0 is equivalent to €X* = &cX =0. If the observations x, are trans- 
formed similarly (x* =cx,), then, for c>G, t* computed from x* is the 
same as ¢t computed from x,. Thus, whatever the unit of measurement the 
Statistical result is the same. 

The generalized T*-test has a similar property. If the vector random 
variable X is distributed according to N(p, 2), then X* = CX (for |C| # 0) is 
distributed according to N(Cp,C2%C’), which is in the same class of distribu- 
tions. The hypothesis &X = 0 is equivalent to the hypothesis 6X” = €CX = 0. 
If the observations x, are transformed in the same way, x* = Cr,, then T** 
computed on the basis of x* is the same as T* computed on the basis of x,. 
This follows from the facts that x* = Ce and A=CAC’ and the following 
lemma: 


Lemma.5.2.1. For any p Xp nonsingular matrices C and H and any 
vector k, 


(15) k'H™'k = (Ck)'(CHC')'(Ck). 


174 THE GENERALIZED T2-STATISTIC 


Proof. The right-hand side of (15) is 
(16) (Ck)'(CHC')' (Ck) =k'C'(C')"' H7'C"'Ck 
=k'H''k, | 
We shall show in Section 5.6 that of all tests invariant with respect to such 
transformations, (13) is the uniformly most powerful. 


We can give a geometric interpretation of the $Nth root of the likelihood 
ratio criterion, 


(17) yy We Ean Fe 7 ¥) Fe 7 F)T 
[D4 (%e — Bo) (%e > Bol 


in terms of parallelotopes. (See Section 7.5.) In the p-dimensional represen- 
tation the numerator of A’/" is the sum of squares of volumes of all 
parallelotopes with principal edges p vectors, each with one endpoint at x 
and the other at an x,. The denominator is the sum of squares of volumes of 
all parallelotopes with principal edges p vectors, each with one endpoint at. 
wi, and the other at x,. If the sum of squared volumes involving vectors 
emanating from x, the “center” of the x,, is much less than that involving 
vectors emanating from py, then we reject the hypothesis that yay is the 
mean of the distribution. 

There is also an interpretation in the N-dimensional representation. Let 
y, =(x,,..5, 2,4)’ be the ‘th vector. Then 


rr 
(18) WNx,= ¥ x, 


is the distance from the origin of the projection of y, on the equiangular line 
(with direction cosines 1/¥N,...,1/ VN ). The coordinates of the projection 
are (x,,...,x,). Then (x,, —%,,...,%,y —%,) is the projection of y, on the 
plane through the origin perpendicular to the equiangular line. The numera- 
tor of A?/* is the square of the p-dimensional volume of the parallelotope 
with principal edges, the vectors (x,,—<,,...,%,y~—%,). A point (x, - 
Hois--+s XN — My,) is obtained from y, by translation parallel to the equiangu- 
lar line (by a distance ¥N y,,). The denominator of A*/" is the square of the 
volume of the parallelotope with principal edges these vectors. Then A/™ is 
the ratio of these squared volumes. 


§.2.2. The Distribution of T* 


in this subsection we will find the distribution of T? under general condi- 
tions, including the case when the null hypothesis is not true. Let T?=Y'S"'Y 
where Y js distributed according to N(v,Z) and nS is distributed indepen- 
dently as £"_,Z,Z’, with Z,,...,Z, independent, each with distribution 


5.2 DERIVATION OF THE T?-STATISTIC AND ITS DISTRIBUTION 175 


N(O,%). The T* defined in Section 5.2.1 is a special case of this with 
Y= ¥N(¥—,) and v= ¥N(p— p,) and n =N—1. Let D be a nonsingu- 
lar matrix such that DE D' = J, and define 


(19) Y*=DY, S*=DSD', v*=Dv. 


Then T* = ¥*'S*~'y* (by Lemma 5.2.1), where Y* is distributed according 
to N(w*,J) and nS* is distributed independently as L7_,Z2*Z*' = 
Lo" _,DZ,(DZ,)’ with the Z*= DZ, independent, each with distribution 
N(O0, 1). We note v'Z~! y= v* (1)! v* = p*'v* by Lemma 5.2.1. 

Let the first row of a p X p orthogonal matrix Q be defined by 


y;" | 
(20) i ~ perpe r= 1],...,p3 


this is permissible because L?,,q?,= 1. The other p — 1 rows can be defined 

by some arbitrary rule (Lemma A.4.2 of the Appendix). Since Q depends on 

Y*, it is arandom matrix. Now let 

: U=QyY"*, 

(21) . 
B= QnS*Q’. 

From the way Q was defined, 


U, = 2q,,¥" =vY*'Y*, 


(22) U, = Lq,¥,* = V¥*'Y* Da,qu = 0. jel, 
Then 
pil ple set ple U, 
(3) DL - ure =(U,0,...40) pS ae, ne 
ppl p22 we ber 0 
= U2", 


where (b')=B7!, By Theorem A.3.3 of the Appendix, 1/b'! =b,, — 
By Bay! Bay = Y\1-2,..., p> Where 


by bi) 
24 B= ‘ 
on Ke By. 


and T?/n = UP/by.. p= Y*'Y*/by..._,- The conditional distribution of 
B given ( is that of L5.,V,V;, where conditionally the V,=@QZ* are 


pero? 


176 THE GENERALIZED T7-STATISTIC 


independent, each with distribution N(0, J). By Theorem 4.3.3 bi... is 
conditionally distributed as D%7$°""W?, where conditionally the W, are 
independent, each with the distribution N(0, 1); that is, by...) is condi- 
tionally distributed as y* with n -~(p-— 1) degrees of freedom. Since the 
conditional distribution of b,,.. __, does not depend on Q, it is uncondition- 
ally distributed as y*, The quantity Y*’¥* has a noncentral y7-distribution 
with p degrees of freedom and noncentrality parameter v*’v* =v'Z7)v, 
Then T*/n is distributed as the ratio of a noncentral y* and an independent 


x’. 


Theorem 5.2.2, Let T?7=Y'S™'Y, where Y is distributed according to 
N(v, %) and nS is independently distributed ay ¥" .,Z,Z', with Z,,...,Z, 
independent, each with’ distribution N(0,%). Then (T?7/n)[n —p+1)/p] is 
distributed as a noncentral F with p and n—p-+1 degrees of freedom and 


noncentrality parameter v'X~'v. If vy =0, the distribution is central F. 
We shall call this the T*-distribution with n degrees of freedom. 


Corollary 5.2.1. Let x,,...,Xy be a sample from N(p,%), and let T? = 
N(X ~ pro)'S7*(¥— pao). The distribution of [T?/CN — ICN — p)/p] is non- 
central F with p and MN — p degrees of freedom and noncentrality parameter 
N(p — po) 27" (pu — po). Lf pe = py, then the F-distribution is central. 


The above derivation of the T*-distribution is due to Bowker (1960). The 
noncentral F-density and tables of the distribution are discussed in Section 
5.4. 

For large samples the distribution of T* given by Corollary 5.2.1 is 
approximately valid even if the parent distribution is not normal; in this sense 
the T?-test is a robust procedure. 


Theorem 5.2.3. Let {X,}, w@=1,2,..., be a sequence of independently 
identically distributed random vectors with mean vector p and covariance matrix 
X; let Xy=(1/N)IN_,X,, Sy =(L/CN — DIDN_ CX, — XQ XX, —Xy)', and 
Ty = N(Xy— Wy)’ Sy'(Xy— po). Then the limiting distribution of Tg as 
N > co is the y?-distribution with p degrees of freedom if p = po. 


Proof. By the central limit theorem (Theorem 4.2.3) ihe limiting distribu- 
tion of YN(X v—b) is N(O,%). The sample covariance matrix converges 
stochastically to £. Then the limiting distribution of Tj is the distribution of 
Y’= 'Y, where Y has the distribution N(0, 2%). The theorem follows from 
Theorem 3.3.3. | 


5.3 USES OF THE T°-STATISTIC 177 


When the null hypothesis is true, T?/n is distributed as Ves Se , and 
mer given by (10) has the distribution of Xi-par /( ar + xp). The 
density of V = y2/( x2 + x7), when x7 and x? are independent. is 


T[4(a +b)] 


a Fla) (Go) 


v-'(1— vy"! = B(v; 4a, $b); 


this is the density of the beta distribution with parameters $a und 4b 
(Problem 5.27). Thus the distribution of a7?" =(1+T7?/n)7' is the beta 
distribution with parameters 4p and 4(n—-—p +1). 


5.3. USES OF THE T?-STATISTIC 


5.3.1. Testing the Hypothesis That the Mean Vector Is a Given Vector 


The likelihood ratio test of the hypothesis pp = py on the basis of a sample of 
N from N(p, X) is equivalent to 


(1) T*2T; 


as given in Section 5.2.1. If the significance level is a, then the 100a% point 
cf the F-distribution is taken, that is, 


(2) Too Pg, gla) = Tver): 


say. The choice of significance level may depend on the power of the test. We 
Shall discuss this in Section 5.4. 

The statistic T* is computed from x and A. The vector A7'(¥ — py) =) 
is the solution of Ab = ¥~ py. Then T*/(N — 1) = N(X— wy)'b. 

Note that T?/(N — 1) is the nonzero root of 


(3) IN(¥= bo)( B= By)! — AAL =O. 


Lemma 5.3.1. Jf v is a vector of p components and if B is a nonsingular 
p Xp matrix, then v'B™' vw is the nonzero root of 


(4) lpy' — AB| =0. 


Proof. The nonzero root, say A,, of (4) is associated with a characteristic 
vector B satisfying 


(5) yv'B=A, BB. 


178 THE GENERALIZED T*-STATISTIC 


(2122) ) 


—— my 


Figure 5.1. A confidence ellipse. 


Since A, #0, v’B #0. Multiplying on the left by v’B™', we obtain 
(6) (vB! y)( vB) = a,(v'B). 7 


In the case above v = VN(¥—p,) and B= A. 


5.3.2. A Confidence Region for the Mean Vector 


lf p is the mean of N(p, %), the probability is 1- a of drawing a sample of 
N with mean x and covariance matrix S$ such that 


(7) N(¥- p)'S7'(¥-p) ST) y-1( a). 


Thus, if we compute (7) for a particular sample, we have confidence 1 ~ a 
that (7) is a true statement concerning p. The inequality 


(8) N(¥—m)'S"'(e-—m) <Tyy_i(a) 


is the interior and boundary of an ellipsoid in the p-dimensional space of m 
with center at ¥ and with size and shape depending on $~' and a. See 
Figure 5.1. We state that p lies within this ellipsoid with confidence 1 — a. 


Over random samples (8) is a random ellipsoid. 


5.3.3. Simultaneous Confidence Intervals for All Linear Combinations 
of the Mean Vector 


From the confidence region (8) for we can obtain confidence intervals for 
linear functions -y’p that hold simultaneously with a given confidence coeffi- 
cient. 


Lemma 5.3.2 (Generalized Cauchy-Schwarz Inequality). For a positive 
definite matrix S, 


(9) (y'y) sy'Syy'S7ly. 


5.3 USES OF THE T*-STATISTIC 179 


Proof. Let b=-y’y/y’Sy. Then 
(10) O< (y= bSy)’'S“!(y - bSy) 
=y'S~ly— by'SS"1y —y’S“'Syb +b’ y'SS~'Sy 


which yields (9). a 


When y =x— p, then (9) implies that 


(11) ly'(¥- p)| < Vy'Sy(¥- p)'S7'(F-p) 
<yy'Sy Tp yer(a)/N 


holds for all y with probability 1~— a. Thus we can assert with confidence 
1—a that the unknown parameter vector satisfies simultaneously for all y 
the inequalities 


(12) ly'x—y'ml < Vy'Sy YTyin-i(a)/N- 


The confidence region (8) can be explored by setting y in (12) equal to 
simple vectors such as (1,0,...,9)' to obtain m,, (1,- 1,0,...,0) to yield 
m, — my, and so on. It should be noted that if only one linear function y'p 


were of interest, YT) y-)(a) = VnpF, ,-p11(4)/(m—p +1) would be 


replaced by ¢,( a). 


5.3.4. Two-Sample Problems 


Another situation in which the T?-statistic is used is one in which the null 
hypothesis is that the mean of one normal population is equal to the mean of 
the other where the covariance matrices are assumed equal but unknown. 
Suppose y{,..., ys? is a sample from N(p®, Z), i= 1,2. We wish to test the 
null hypothesis p™ = p®, The vector y is distributed according to 


N[p,(1/N) 2%]. Consequently yN,N2/(N, + Nz) (¥ — §%) is distributed 
according to N(0, %) under the null hypothesis. If we let 


N. 
1 : s ss 
(13) S= AeRea{ © (0-9)? 9) 


+E (9-9) 94, 


an! 


180 THE GENERALIZED T2-STATISTIC 


then (N, + N2— 2)S is distributed as DNi*42-°Z_ Z!, where Z, is distributed 
according to N(0, Z). Thus 


NN. 
(14) Coy enh eS ee) 


is distributed as T* with N, +N, — 2 degrees of freedom. The critical region 
is 


(N, +N. —-2)p 


(15) T? > NEN =p = 1PM eno) 


with significance level a. 
A confidence region for p!" — p with confidence level 1 — a is the set of 
vectors m Satisfying 


(16) (5 — 2 — m)/S-!( 7 — 5 — ma) 
+N. 
s NN, Po +n ~2(@) 
N, +N, (N, +N) —2)p 
= A WE eT. N,+Nz~p-1(). 


Simultaneous confidence intervals are 


N, er 


(17) by’(¥#9 -— 9) — y'ml < Vy'S 


T,. N, +N,-2(@) - 


An example may be taken from Fisher (1936). Let x, =sepal length, 
X_ = sepal width, x, = petal length, x, = petal width. Fifty observations are 
taken from the population Jris versicolor (1) and 50 from the population /ris 
setosa (2). See Table 3.4. The data may be suminarized (in centimeters) as 


5.936 
2.770 
4.260 |” 
1.326 


5.006 
7 3,428 
(2) — 
(19) = 1.462}? 
0.246 


(18) x0 = 


5.3 USES OF THE T°-STATISTIC 181 


19.1434 9.0356 9.7634 3.2394 
_ | 9.0356 11.8658 4.6232 2.4746 

a) ae 9.7634 4.6232 12.2978 3.8794 |° 
3.2394 2.4746 3.8794 2.4604 


The value of T?/98 is 26.334, and (T*/98) x 2% = 625.5. This value is highly 
significant compared to the F-value for 4 and 95 degrees of freedom of 3.52 
at the 0.01 significance level. 

Simultaneous confidence intervals for the differences of component means 
pw —~ nw, #=1,2,3,4, are 0.930 + 0.337, —0.658 + 0.265, —2.798 + 0.270. 
and 1.080 + 0.121. In each case 0 does not lie in the interval. [Since ,.(.01) < 
T, 93(.01), a univariate test on any component would lead to rejection of the 
null hypothesis.] The last two components show the most significant differ- 
ences from 0. 


5.3.5, A Problem of Several Samples 


After considering the above example, Fisher considers a third sample drawn 
from a population assumed to have the same covariance matri:. He treats the 
same measurements on 50 /ris virginica (Table 3.4). There is a theoretical 
reason for believing the gene structures of these three species to be such that 
the mean vectors of the three populations are related as 


(21) 3p) = pO + 2p), 


where yp) is the mean vector of the third population. 

This is a special case of the following general problem. Let (x¢"}. a= 
1,...,N, i=1,...,qg, be samples from N(p'?, 2%), i= 1,....¢, respectively. 
Let us test the hypothesis 


. q 
(22) AH: Bl? =p, 


q : q 
(23) | E930 a| s*| Ya” al. 
f=] r=] ! 
where 
1% 
(24) x) == WN 5 ih 


182 THE GENERALIZED T?-STATISTIC 


q q ON, 
(25) | » N, -a|s _ ¥ (xl!) — x) (x09 — ZO)?, 
a | i=l a=1 
1 q fe 
(26) c = > NN: 


This T° has the T*-distribution with L7.,N,—q degrees of freedom. 

Fisher actually assumes in his example that the covariance matrices of the 
three populations may be different. Hence he uses the technique described in 
Section 5.5, 


5.3.6. A Problem of Symmetry 


Consider testing the hypothesis H:m,=.= + =m, on the basis of a 


sample x),...,%, from N(p, 2%), where pw’ =(yj,...,@,). Let C be any 
(p—1)Xp matrix of rank p— 1 such that 

(27) Ce=0, 

where e’ =(1,...,1). Then 

(28) Ya =Cx,, a=1,...,N, 


has mean Cy and covariance matrix CXC’. The hypothesis H is Cu =0. 
The Statistic to be used is 


(29) T* = Ny'S""Y, 
where 
So die 
(30) y= ND r= Ce, 
—_— Z 
(31) Sg te Ge yy 


l N 
= woe Der Aa RYE 


This statistic has the T*-distribution with N—1 degrees of freedom for a 
(p— 1)-dimensional distribution. This T?-statistic is invariant under any 
linear transformation in the p—1 dimensions orthogonal to «. Hence the 
statistic is independent of the choice of C. 

An example of this sort has been given by Rao (1948b). Let N be the 
amount of cork in a boring from the north into a cork tree; let E, S, and W 
be defined similarly. The set of amounts in four borings on one tree is 


5.3 USES OF THE T*-STATISTIC 183 


considered as an observation from a 4-variate normal distribution. The 
question is whether the cork trees have the same amount of cork on each 
side. We make a transformation 


(32) y,=S-W, 
yz=N-S. 


The number of observations is 28. The vector of means is 


(33) y= 


the covariance matrix for y is 


(34) S=| 6141 56.93 28.30 


—21.02 —28.30 63.53 


The value of T?/(N — 1) is 0.768. The statistic 0.768 x 25/3 = 6.402 is to be 
compared with the F-significance point with 3 and 25 degrees of freedom. It 
is significant at the 1% level. 


ee 61.41 —21.02 


5.3.7. Improved Estimation of the Mean 


In Section 3.5 we considered estimation of the mean when the covariance 
matrix was known and showed that the Stein-type estimation based on this 
knowledge yielded lower quadratic risks than did the sample mean. In 
particular, if the loss is (m— y)’2~'(m — wp), then 


p-2 os 
(35) ea | (¥-v)t+y 


is a minimax estimator of y for any w and has a smaller risk than x when 
p23. When ~ is unknown, we consider replacing it by an estimator, namely, 
a multiple of A =n. 


Theorem 5.3.1. When the loss is (m—p)'27'(m — p), the estimator for 
p23 given by 


(36) eoen pares, (x-v)+v 


has smaller risk than X and ts minimax for 0<a<2(p—-2)/(n—p +3), and 
the risk is minimized for a =( p — 2)/(n — p + 3). 


184 THE GENERALIZED 72-STATISTIC 


Proof. As in the case when % is known (Section 3.5.2), we can make a 
transformation that carries (1/N) to J. Then the problem is to eStimate p 
based on Y with the distribution N(p,J) and A=L"_,Z,Z', where 
Z,,--.,2, are independently distributed, each according to N(0, J), and the 
loss function is (m — w)’(m — p). (We have dropped a factor of N.) The 
difference in risks is 


(37) 


2 a 
AR(wp) = én tll -|( ad oan yp 


a P 
= 6, yoy ao); L(Y - w)(% ~ 4) 


- aro 
[(¥-v)’4-(¥~»)] 


The proof of Theorem 5.2.2 shows that (Y— v)’A7~'(¥Y—v) is distributed as 


IlY — vll?/y2_,,,, where the x2 is independent of Y. Then the differ- 
f ana ptl n—-pt+l 
ence in risks is 


2a a-ptl] aes ; 
(38) AR(w) = a pete Be PAO fae a Sea) 


2 
= 2a( p — 2) Xn p+) a?( Teed) 
ly— vl? ly — vl? 


= {2 p-2)(n—p+l)a 


2 1 
—|2(n-p+l)+(n-p+t1) 2? are 


The factor in braces is n-p+1 times 2p ~2)a—(n—p+3)a’, which 
is positive for 0<a<2%p—2)/(n—p+3) and is maximized for a= 
(p—2)/(n —p + 3). a 


The improvement over the risk of ¥ is (n —p+1)(p—2)?/(n—p +3): 
&,\l¥ — v||"7, as compared to the improvement ( p — 2)? é, \l¥ - vl"? of my) 
of Section 3.5 when & is known. 


54 DISTRIBUTION UNDER ALTERNATIVES? POWER FUNCTION 185 


Corollary 5.3.1. The estimator for p > 3 


if 


a 
N(¥-—v)'A7'(¥-v) 


has smaller risk than (36) and is minimav for 0 <a <20p-2)/(n—p +3). 


(39) (x-v)t+v 


Proof. This corollary follows from Theorem 5.3.1 and Lemma 3.35.2. | 


The risk of (39) is uot necessarily minimized at a =(p—2)/(n—p+3). 
but that value seems like a good choicc. This is the cstimator (18) of Section 
3.5 with £ replaced by [1/(n — p + 3)JA. 

When the loss function is (m— w)’Q(m — wp), where @ is an arbitrary 
positive definite matrix, it is harder to present a uniformly improved estima- 
tor that is attractive. The estimators of Section 3.5 can be used with = 
replaced by an estimate. 


5.4, THE DISTRIBUTION OF T? UNDER ALTERNATIVE 
HYPOTHESES; THE POWER FUNCTION 


In Section 5.2.2 we showed that (T*/n\N — p)/p has a noncentral F-distri- 
bution. In this section we shall discuss the noncentral F-distribution. its 
tabulation, and applications to procedures based on T°. 

The noncentral F-distribution is defined .1s the distribution of the ratio of 
a noncentral y* and an independent y~° divided by the ratio of correspond- 
ing degrees of freedom. Let V have the noncentral y*-distribution with p 
degrees of freedom and noncentrality parameter r~ (as given in Theorem 
3.3.5), and let W be independently distributed as y° with m degrees of 
freedom. We shall find the density of F=(V/p)/(W’/m), which ts the 
noncentral F with noncentrality parameter T°. The joint density of 1’ and W 
is (28) of ‘Section 3.3 multiplied by the density of W, which is 
27” T-!(Lm)wi"-len , The joint density of F and W (dv = pwdf/m) is 


(1) 


42 
—esT- 


é Z 
Qi per (lin) 


e —ov(l+p fsa) 


a 1,8 4+ B> 
ge by & | (22 " Nyda 
ie ped ‘ BIT(3p+B) a 


The marginal density, obtained by integrating (1) with respect to w from 0 
to oo, is 


i 


2 x 


(2) pene (17/2)"( pfym)"F' TEAC p +m) +B] 
mT (4m) p=0 BIT(tp + BYU + pf/mytr 


186 THE GENERALIZED T?-STATISTIC 


Theorem 5.4.1. if V has a noncentral x*-distribution with p degrees of 
freedom and noncentrality parameter Tt”, and W has an independent x *-distribu- 
ton with m degrees of freedom, then F =(V/p)/(W/m) has the density (2). 


The density (2) is the density of the noncentral F-distribution. 

If T° = N(X— py)'S~'(¥— py) is based on a sample of N from N(w, 2), 
then (7*/nXN — p)/p has the noncentral F-distribution with p and N- p 
degrees of freedom and noncentrality parameter N(p — py)’ 27 '( — pop) = 
T*, From (2) we find that the density of T? is 


(3) oe ee 2) [A/(N ~ PPT + 8) 
(N-DTEEN=p)] gro girp + pi +e2/(N-1)] 
i eee > iN 
“WaT Ie Nt) (+ w=] 
evi UF, INP |: 
where 
(4) F(a; b; x) = p saa 


» P(a)P(b + B)B!" 


The density (3) is the density of the noncentral T?-distribution. 

Tables have been given by Tang (1938) of the probability of accepting the 
null hypothesis (that is, the probability of Type II error) for various values of 
t* and for significance levels 0.05 and 0.01. His number of degrees of 
freedom f, is our p (1(1)8], his f, is our n — p+ 1 (2, 4(1)30, 60,00], and his 
noncentrality parameter ¢ is related to our t* by 


T 


(3) p= yp+1 


[1(+)3(1)8]. His accompanying tables of significance points are for T? /(T? + 
N—1). 

As an example, suppose p = 4, n- p+1= 20, and consider testing the 
null hypothesis pz = 0 at the 1% level of significance. We would like to know 
the probability, say, that we accept the null hypothesis when ¢ = 2.5 (7? = 
31.25). It is 0.227. If we think the disadvantage of accepting the null 
hypothesis when N, p, and £ are such that 7? = 31.25 is less than the dis- 
advantage of rejecting the null hypothesis when it is true, then we may find it 


5.5 TWO-SAMPLE PROBLEM WITH UNEQUAL COVARIANCE MATRICES 187 


reasonable to conduct the test as assumed. However, if the disadvantage of 
one type of error is about equal to that of the other, it would seem reason- 
able to bring down the probability of a Type II error. Thus, if we use a 
significance level of 5%, the probability of Type II error (for @ = 2.5) is only 
0.043. 

Lehmer (1944) has computed tables of @ for given significance level and 
given probability of Type II error. Here tables can be used to see what value 
of r* is needed to make the probability of acceptance of the null hypothesis 
sufficiently low when yw ¥ 0. For instance, if we want to be able to reject the 
hypothesis j. = 0 on the basis of a sample for a given yw and %, we may be 
able to choase N so that Nu’Z~!p = 7° is sufficiently large. Of course, the 
difficulty with these considerations is that we usually do not know exactly the 
values of ys and & (and hence of 1?) for which we want the probability of 
rejection at a certain value. 

The distribution of T* when the null hypothesis is not true was derived by 
different methods by Hsu (1938) and Bose and Roy (1938). 


5.5. THE TWO-SAMPLE PROBLEM WITH UNEQUAL 
COVARIANCE MATRICES 


If the covariance matrices are not the same, the T*-test for equality of mean 
vectors has a probability of rejection under the null hypothesis that depends 
on these matrices. If the difference between the matrices is small or if the 
sample sizes are large, there is no practical effect. However, if the covariance 
matrices are quite different and/or the sample sizes are relatively small, the 
nominal significance level may be distorted. Hence we develop a procedure 
with assigned significance level. Let {x“}, a=1,...,N,, be samples from 
N(w, %,), i= 1,2. We wish to test the hypothesis H:p!” = pw. The mean 
¥® of the first sample is normally distributed with expected value 


(1) EY = yO 


and covariance matrix 


ro 


2) 8(2 — p)(20 ~ pO) = AB. 


Similarly, the mean ¥@ of the second sample is normally distributed with 
expected value 


(3) xO = pi? 


188 THE GENERALIZED T*-STATISTIC 


and covariance matrix 
a a 1 
(4) E( EO — p)(2@ — yl)" = N22: 


Thus x‘? —*%® has mean p{? — p® and covariance matrix (1/N,)2, + 
(1/N,)%,. We cannot use the technique of Section 5.2, however, because 


N, No 
(5) XL (x — 2) (2 — 20) + YF (x — 2%) (xO — 2)! 


a=1 


does not have the Wishart distribution with covariance matrix a multiple of 
(1/N,), +(1/N,)E>. 

If N,=N,=N, say, we can use the T?-test in an obvious way. Let 
y, =x) —x@ (assuming the numbering of the observations in the two 
samples is independent of the observations themselves). Then y, is normally 
distributed with mean p? — yp and covariance matrix 2,+2,, and 
Ys +++) Yy are independent. Let y=(1/N)LN_, y, =X? - =, and define S 
by 


N 
(6) (N-1S= YF (y-F¥)( ye)’ 


N 
=} (a? = x2 — x ee) (2 — x — ZO eee ie 


a=] 


Then 
(7) T? =Ny'S”'y 


is suitable for testing the hypothesis yp — p = 0, and has the T*-distribu- 
tion with N—1 degrees of freedom. It should be observed that if we had 
known £, = %,, we would have used a T?-statistic with 2N — 2 degrees of 
freedom; thus we have lost N — 1 degrees of freedom in constructing a test 
which is independent of the two covariance matrices. If N, = Nz =50 as in 
the example in Section 5.3.4, then T;7,(.01) = 15.93 as compared to T/,(.01) 
= 14.52. 


Now let-us turn our attention to the case of N, # N. For convenience, let 
N, <N,. Then we define 


[N, 
8 er) ey ae ¢: 9 1? - a xO, =1,...,N 
( ) Ja So N, Xa —— Ny S x a l 


5.5 TWO-SAMPLE PROBLEM WITH UNEQUAL COVARIANCE MATRICES 189 


The expected value of y, is 


N N N, 
9 Sy. = ph —,f 1 yO 4 — ye (2) = (1) _ (2) 
(9) Ya= wo — Vi AH INN, WN, e TBO 


The covariance matrix of y, and y, is 


N, 
(10) 8 (te~ 8%6)(Ip~ Sp)’ = Sep [Er + FEZ]: 


Thus a suitable statistic for testing a!” — a’? =0, which has the T’-distribu- 
tion with N, — 1 degrees of freedom, is 


(11) T?=N,y'S™'Y, 


where 


N 
(12) YS yg = XY — FA) 


Net! 


and 


Ni NM 

(13) (N,—1)S= 7» (Ya ¥)( Ya ¥)' = » (u,—u)(u,~H)', 
ac a= 

where 7@=(1/N) DNL yu, and wu, =x — YN,/N, xO, a=1,...,N). 

This procedure was suggested by Scheffé (1943) in the cinivariate case. 
Scheffé showed that in the univariate case this technique gives the shortest 
confidence intervals obtained by using the f-distribution. The advantage of 
the method is that x“? —-%@ jis used, and this statistic is most relevant to 
uw) — uw. The sacrifice of observations in estimating a covariance matrix is 
not so important. Bennett (1951) gave the extension of the procedure to the 
multivariate case. 

This approach can be used for more general cases. Let {x}, a=1,...,N,, 
i=1,...,g, be samples from N(w", 2,), i=1,...,q, respectively. Consider 
testing the hypothesis 


(14) H: 5 Be = 


r= | 


where f),...,8, are given scalars and p is a given vector. If the N, are 
unequal, take N, to be the smallest. Let 


N 
1 | 
(15) y= Bix + ¥ 6 R [a 2 zy xg) + == 
i=2 NN, y=l1 


yy x : 


190 THE GENERALIZED T?-STATISTIC 


Then @y, = £7_, By, and 


1 B2N 
(16) #(.- 630M I~ EY = Fp E AE 
Let y and S be defined by 
1 OM 1 ™ 
(17) j= Hy Lm » Be, = a Lap, 
MN 


(18) (M -DS= YF (x. - Fy 5): 


a=] 


Then 
(19) T°=N(¥—B)'S'(9- pw) 


is suitable for testing H, and when the hypothesis is true, this statistic has the 
T*-distribution for dimension p with N, —1 degrees of freedom. If we let 
u,= Li, By N|/N.x@, a=1,...,.N,, then S$ can be defined as 


(20) (N,-1)s- , ome 


Another problem that is amenable to this kind of treatment is testing the 
hypothesis that two subvectors have equal means. Let x = (x')’, x)’ be 
distributed normally with mean p = (p?’, wp")! and covarience matrix 


Zu 212 
21 L= F 
i ie £34 


We assume that x and x are each of g components. Then y =x? — x 
is distributed normally with mean yp" — yw and covariance matrix £, = Zy, 
— £1, -— 2, + £2. To test the hypothesis p!” = p!? we use a T2-statistic 
Ny'S;'y, where the mean vector and covariance matrix of the sample are 
partitioned similarly to p and %. 


5.6. SOME OPTIMAL PROPERTIES OF THE T?-TEST 


5.6.1. Optimal Invariant Tests 


In this section we shall indicate that the 7*-test is the best in certain classes 
of tests and sketch briefly the proofs of these results. 
The hypothesis jp = 0 is to be tested on the basis of the N observations 


Xi,....%y from N(p, 2). First we consider the class of tests based on the 


5.6 SOME OPTIMAL PROPERTIES OF THE 7?-TEST 191 


statistics A = L(x, —x¥x, —¥)' and ¥ which are invariant with respect to 
the transformations A* =CAC’ and x* = Cx, where C is nonsingular. The 
transformation x* = Cx, leaves the problem invariant; that is, in terms of x 
we test the hypothesis @x* =0 given that xf,...,x%, are N observations 
from a multivariate normal population. It seems reasonable that we require a 
solution that is also invariant with respect to these transformations; that is, 
we look for a critical region that is not changed by a nonsingular linear 
transformation. (The defin.tion of the region is the same in different coordi- 
nate systems.) 


Theorem 5.6.1. Given the observations x,,...,x, from N(p,%), of all 
tests of w= based on & and A= L(x, — Xx, —¥)' that are invariant with 
respect to transformations x* = Cx, A* = CAC' (C nonsingular), the T?-test is 
uniformly most powerful. 


Proof. First, as we have seen in Section 5.2.1, any test based on T? is 
invariant. Second, this function is essentially the only invariant, for if f(X, A) 
is invariant, then f(x, A) = f(x*, J), where only the first coordinate of x* is 
different from zero and it is Vx'A~'xX. (There is a matrix C such that 
Ce =x* and CA4C'=J.) Thus f(%, A) depends only on ¥’A~'X%. Thus an 
invariant test must be based on ¥'A~'X. Third, we can apply the Neyman- 
Pearson fundamental lemma to the distribution of T? [(3) of Section 5.4] to 
find the uniformly most powerful test based on T* against a simple alterna- 


tive 7* = Nw’~!p. The most powerful test of 7? = 0 is based on the ratio of 
(3) of Section 5.4 to (3) with 7” =. The critical region is 


(1) 


BENE cd) -ba | 


al 
ce ? 


Pea) Mir Ma Cae 
ono all (spt+a) 


(t?/n)? "(1 +P yny Prin + 1)] 


l(3p) 
Gp) te PY T nt) tal (_2/m | 
T[i(n + 1] pom all(zpta) L+27/n 


The right-hand side of (1) is a strictly increasing function of (t?/n)/(1 + t7/n), 
hence of ¢?. Thus the inequality is equivalent to 7 > k for k suitably chosen. 
Since this does not depend on the alternative 7’, the test is uniformly most 
powerful invariant, a 


192 THE GENERALIZED T?-STATISTIC 


Definition 5.6.1. A critical function W(%, A) is a function with values 


between 0 and 1 (inclusive) such that w(x, A) = e, the significance level, when 
p= 0. 


A randomized test consists of rejecting the hypothesis with probability 
w(x, B) when =x and A=B. A nonrandomized test is defined when 
w(x, A) takes on only the values 0 and 1. Using the form of the 
Neyinan—Pearson lemma appropriate for critical functions, we obtain the 
following corollary: 


Corollary 5.6.1. On the basis of observations x,,...,xX, from N(p, %), of 
all randomized tests based on x and A that are invariant with respect to 
transformations x* = Cx, A* = CAC’ (C nonsingular), the T?-test is uniformly 
most powerful. 


Theorem 5.6.2, On the basis of observations x,,,..,x, from N(p,%), of 
all tests of =O that are invariant with respect to transformations x* = Cx, 
(C nonsingular), the T?-test is a uniformly most powerful test; that is, the T?-test 
is at least as powerful as any other invariant test. 


Proof. Let w(x,,...,x,) be the critical function of an invariant test. Then 


(2) E[ (xy...) xy] fe Ee, a{ Eb (1.0.4, 2y)[E, Al}. 


Since x, A are sufficient statistics for w,2, the expectation ¢[W(x,..., 
xx )|¥, A] depends only on <, A, It is invariant and has the same power as 
w(x),...,X,). Thus each test in this larger class can be replaced by one in 


the smaller class (depending only on ¥ and A) that has identical power. 
Corollary 5.6,1 completes the proof. a 


Theorem 5.6.3. Given observationsx,,...,X, from N(p, &), of all tests of 
w= 0 based on & and A= U(x, -—X)(x, —)' with power depending only on 
Np’ 7 'p, the T*-test is uniformly most powerful. 


Proof. We wish to reduce this theorem to Theorem 5.6.1 by identifying the 
class of tests with power depending on Nw’2~'p with the class of invariant 
tests. We need the following definition: 


Definition 5.6.2. A test p(x,,...,X,) is said to be almost invariant if 
(3) W(Xy,--.5%y) = W(Cxy,..., Cry) 
for all x,,...,Xy except for a set of X,,...,X) of Lebesgue measure zero, this 


exception set may depend on C. 


5.6 SOME OPTIMAL PROPERTIES OF THE T--TEST 193 


It is clear that Theorems 5.6.1 and 5.6.2 hold if we extend the definition of 
invariant test to mean that (3) holds except for a fixed set of -7,,....x, of 
measure 0 (the set not depending on C). It has been shown by Hunt and 
Stein [Lehmann (1959)] that in our problem almost invariance implies invari- 
ance (in the broad sense). 

Now we wish to argue that if w(x, A) has power depending only on 
Nwp’=~'p, it is almost invariant. Since the power of w(x, A) depends only on 
Np’ Zp, the power is 


(4) E, y W(X, A) = Ser ciyee-ty W(¥, A) 


= 6, .(CE, CAC’). 


The second and third terms of (4) are merely different ways of writing the 
same integral. Thus 


(5) &. yl W(¥, A) — p(C#,CAC’)| = 0, 


identically in ya, 2. Since ¥, A are a complete sufficient set of statistics for 
pw, 2 (Theorem 3.4.2), f(x, A) = W(X, A) — p(Cx,CAC') =0 almost every- 
where. Theorem 5.6.3 follows. a 


As Theorem 5.6.2 follows from Theorem 5.6.1, so does the following 
theorem from Theorem 5.6.2: 


Theorem 5.6.4. On the basis of observations x,,...,X, from N(p.%). of 
all tests of = 0 with power depending only on Np'Z~'p, the T?-test is a 
uniformly most powerful test. 


Theorem 5.6.4 was first proved by Simaika (1941). The results and proofs 
given in this section follow Lehmann (1959). Hsu (1945) has proved an optimal 
property of the T?-test that involves averaging the power over p and ©. 


5.6.2. Admissible Tests 


We now turn to the question of whether the T’-test is a good test compared 
to all possible tests; the comparison in the previous section was to the 
restricted class of invariant tests. The main result is that the 7--test is 
admissible in the class of all tests; that is, there is no other procedure that is 
better. 


Definition 5.6.3. A test T* of the null hypothesis Hy): wo € 0, against the 
alternative w © 2, (disjoint from 0,) is admissible if there exists no other test T 


194 THE GENERALIZt D 1?-STATISTIC 
sucht that 


(6) Pr{Reject Hy|7T, w} < Pr{Reject Hy|T*, w}, wEN,, 
(7) Pr{Reject Hy|T, w} = Pr{Reject H,|7*, w}, wo EQ 


1? 


with strict inequality for at least one w. 


The admissibility of the T?-test follows from a theorem of Stein (1956a) 
that applies to any exponential family of distributions. 

An exponential family of distributions (7, 4, m, ©, P) consists of a finite- 
dimensional Euclidean space ~”, a measure mm on the o-algebra @ of all 
ordinary Borel sets of ¥/, a subset {2 of the adjoint space %' (the linear 
space of all real-valued linear functions on @) such that 


(8) ww) = fe"? dm(y) <0, wet, 


and P, the function on 2 to the set of probability measures on @ given by 
P(A) = —1~ f e” dm(y), AE. 
ww) J, 


The family of normal distributions N(j,%) constitutes an exponential 
family, for the density can be written 


tp pp 


e - “yc ly ~) t 
9 n(x| F y)= ————— e* Loe tin( = gk ax! 
- elon) (27) ?| ZI 


We map from .2’to ~/; the vector y=(y’, y®’)’ is composed of y'? =x 
and yO = (x7, 2x,x2,..., pe eae Me ts x) The vector @ = (o!’, a’) is 
composed of w= Z-!y and wow = — }(o!!,a!,..., 0)? 0 %,..., 0 PP), 
where (0) =2%7~!; the transformation of parameters is one to one. The 
measure m(A) of a set A © @ is the ordinary Lebesgue measure of the sei of 
x that maps into the set A. (Note that the probability measure in % is not 
defined by a density.) 


Theorem 5.6.5 (Stein). Let (4, 4,m,0,P) be an exponential family 
and 1), a nonempty proper subset of 1. (i) Let A be a subset of & that is closed 
and convex. (il) Suppose that for every vector wo © &Y' and real c for which 
{yloo’y > c} and A are disjoint, there exists ww, € O such that for arbitrarily large 
A the vector wo, + Aw € O — OD). Then the test with acceptance region A is admis- 
sible for testing the hypothesis that w © 0, against the alternative w € 1 — No. 


5.6 SOME OPTIMAL PROPERTIES OF THE T?-TEST 195 


w'y>e 


Figure 5.2 


The conditions of the theorem are illustrated in Figure 5.2, which is drawn 
simultaneously in the space Y and the set 2. 


Proof. The critical function of the test with acceptance region A is 
p(y) = 0. y EA, and $,(y)=1, y A. Suppose $C y) is the critical function 
of a better test, that is, 


(10) [$(¥) aPa(y) < [b4(y) aP.(y), WEN, 


(11) [$(y) aP.(y) 2 fda y) 4P.(9), @ E0-p, 


with strict inequality for some ow; we shall show that this assumption leads to 
a contradiction. Let B= {y| b(y) < 1). (If the competing test is nonrandom- 
ized, B is its acceptance region.) Then 


(12) (yl d4(y) — 6(y) >0}=4NB, 


where A is the complement of A. The m-measure of the set (12) is positive; 
otherwise ¢,(y)= $(y) almost everywhere, and (10) and (11) would hold 
with equality for all o. Since A is convex, there exists an o and a c such that 
the intersection of AB and {y|w’y > c} has positive m-measure. (Since A 
is closed, A is open and it can be covered with a denumerable collection of 
open spheres, for example, with rational radii and centers with rational 
coordinates. Because there is a hyperplane separating A and each sphere, 
there exists a denumerable coilection of open half-spaces H, disjoint from A 
that covers A. Then at least one half-space has an intersection with A MB 
with positive m-measure.) By hypothesis there exists o, € 1 and an arbitrar- 
ily large A such that 


(13) @,=0, tAMWEDN—Dp. 


196 THE GENERALIZED T?-STATISTIC 


Then 
(14) f[da(y) - $(y)] aP.(9) 
“reali - &(y)] eH deny) 


~ HOS fib») - oC] ee ar, (9) 


= Hee foal y) - o(a)]eMO7- dP, (9) 


_ #() er 
W(@,) 


(pel #a(a) ~ o(ap]et?- dP. (9) 


+f [oa6a) = a nnlenee'ar, (9). 


For w’y >c we have d,(y) = Land ¢,(y)— Cy) = 0, and (yl dy) — oCy) 
> 0} has positive measure; therefore, the first integral in the braces ap- 
proaches co as A > ow. The second integral is bounded because the integrand 
is bounded by 1, and hence the last expression is positive for sufficiently large 
A. This contradicts (11). | 


This proof was given by Stein (1956a). It is a generalization of a theorem 
of Birnhaum (1955). 


Corollary 5.6.2. if the conditions of Theorem 5.6.5 hold except that A is 
not necessarily closed, but the boundary of A has m-measure 0, then the 
conclusion of Theorem 5.6.5 holds. 


Proof. The closure of A is convex (Problem 5.18), and the test with 
acceptance region equal to the closure of A differs from A by a set of 
probability 0 for all wo © 2. Furthermore, 


(15) AN{yla’y>c}=O = AC{ylw’y<c} 
= closure A C{ylo’y<c}. 


Then Theorem 5.6.5 holds with A replaced by the closure of .4. = 


Theorem 5.6.6. Based on observations x,,...,X%y from N(p, %), 
Hotelling’s T*-test is admissible for testing the hypothesis p = 0. 


5.6 SOME OPTIMAL PROPERTIES OF THE 7'7-TEST 197 


Proof. To apply Theorem 5.6.5 we put the distribution of the observations 
into the form of an exponential family. By Theorems 3.3.1 and 3.3.2 we can 
transform x,,..,,Xy tO z,= Lga)CagX,» where (c,,) is orthogonal and z, 
= (NX. Then the density of z,,..., zy (with respect to Lebesgue measure) is 


a7 ENp yp 


(16) - 


wa 
(mpm x] 


. 
VN p’Xo'zy + tr(-427') Y ta? 


a=) 


The vector y=(y", y®’)’ is composed of y =z, (= VNX) and y"l= 
(by1,2B)9,..-,20,,,by,-..,0,,)', where 


N N 
(17) B= es ZS), | - s ns]. 
a=} a=] 
The vector w = (o’, a’) is composed of wf =¥NE'p and wl? = 
Moto? a? o%,...,077). The measure m(A) is the Lebesgue 
measure of the set of z,,...,Z, that maps into the set A. 


Lemma 5.6.1. Let B =A + Nxx'. Then 


Nx'Bo'x 


(18) Nx'Aq'E = = 
1 —- Nx'Bo'x 


Proof of Lemma. If we let B=A+ VNXVNX’ in (10) of Section 5.2, we 
obtain by Corollary A.3.1 


(19) ayy, B= VEL 
1+T°/(N-1) [Bl 


=1—-Nx'Bo'x. z= 
Thus the acceptance region of a T*-test is 
(20) A= {zy, BlzyBo'z, <k, B positive definite} 


for a suitable k. 

The function zy B~'z, is convex in (z, B) for B positive definite (Problem 
5.17). Therefore, the set zy B~'z, <k is convex. This shows that the set A is 
convex. Furthermore, the closure of A is convex (Problem 5.18). and the 
probability of the boundary of 4 is 0. 

Now consider the other condition of Theorem 5.6.5. Suppose A ts disjoint 
with the half-space 


(21) C< wy =v'z.— sUrAB, 


198 ‘THE GENERALIZED T?-STATISTIC 


where A is a Symmetric matrix and B is positive semidefinite. We shall take 
A, =I. We want to show that wo, + Aw E02 - 2,; that is, that vp) +Av *0 
(which is trivial) and A, + AA is positive definite for A > 0. This is the case 
when A is positive semidefinite. Now we shall show that a half-space (21) 
disjoint with A and A not positive semidefinit2 implies a contradiction. If A 
iS Not positive semidefinite, it can be written (by Corollary A.4.1 of the 
Appendix) 


I 0 0 
(22 A=D\0 -1 0|D’, 
0 0 0 
where D is nonsingular. If A is not positive semidefinite, —J is not vacuous, 


because its order is the number of negative characteristic roots of A. Let 
z, =(l/y)z, and 


I 0 0 
(23) B=(D')'|0 yl O|D" 
0 oi] 
Then 
1 1, | 9 8 
(24) OS ey oe 0 y 04, 
0 oOo 0 


which is greater than c for sufficiently large y. On the other hand 


\ I 0 0 
(25) ztwBlzn=—z%DIO y'L 01D‘, 
4 0 o7 


which is less than k for sufficiently large y. This contradicts the fact that (20) 
and (21) are disjoint. Thus the conditions of Theorem 5.6.5 are satisfied and 
the theorem 1s proved. = 


This proof is due to Stein. 

An alternative proof of admissibility is to show that the 7’-test is a proper 
Bayes procedure. Suppose an arbitrary random vector X has density f(x] @) 
for w <2. Consider testing the null hypothesis H,:@€¢, against the 
alternative H,:0 € 2 —Qp. Let WH) be a prior finite measure on 0), and II, 
a prior finite measure on 11,. Then the Bayes procedure (with 0-1 loss 


5.7 ELLIPTICALLY CONTOURED DISTRIBUTIONS 199 


function) is to reject Ho if 


[f(*lo)11,(do) 
———- > 


(26) > 
[f(21@)T1)(de) 


c 


for some c (0<c <0). If equality in (26) occurs with probability 0 for all 
w € 1), then the Bayes procedure is unique and hence admissible. Since the 
measures are finite, they can be normed to be probability measures. For the 
T?-test of Hy:p = 0a pair of measures is suggested in Problem 5.15. (This 
pair is not unique.) The reader can verify that with these measures (26) 
reduces to the complement of (20). 

Among invariant tests it was shown that the T*-test is uniformly most 
powerful; that is, it is most powerful against every value of p’£~'p among 
invariant tests of the specified significance level. We can ask whether the 
T*-test is “best” against a specified value of w’2~'p among all tests. Here 
“best” can be taken to mean admissible minimax; and “minimax” means 
maximizing with respect to procedures the minimum with respect to parame- 
ter values of the power. This property was shown in the simplest case of 
p=2and N =3 by Giri, Kiefer, and Stein (1963). The property for general p 
and N was announced by Salaevskii (1968). He has furnished a proof for the 
case of p =2 (Salaevskil (1971)], but has not given a proof for p > 2. 

Giri and Kiefer (1964) have proved the T?-test is locally minimax (as 
p=" 'p > 0) and asymptotically (logarithmically) minimax as p’Z~'p. > 00. 


5.7. ELLIPTICALLY CONTOLRED DISTRIBUTIONS 


§.7.1. Observations Elliptically Contoured 


When x,,...,X, constitute a sample of N from 


(1) [Al ~?g[(x~v)y'A7(x—-v)], 


the sample mean ¥ and covariance S are unbiased estimators of the distribu- 
tion mean =v and covariance matrix £=(&R*/p)A, where R?= 
(X-—v)’'A7'(X—v) has finite expectation. The T*-statistic, T? = N(¥—- 
p.)’S~'(% — x), can be used for tests and confidence regions for » when = 
(or A) is unknown, but the small-sample distribution of T? in general is 
difficult to obtain. However, the limiting distribution of T? when N — oo is 
obtained from the facts that YN(¥~p) 5 N(@,) and S54 (Theorem 
3.6.2). 


200 THE GENERALIZED T°-STATISTIC 


Theorem 5.7.1, Let x,,...,Xy be a sample from (1). Assume &R?* < ov. 
Then T* Xp 


Proof. Theorem 3.6.2 implies that N(¥—p)'Z7'\(Z-p)S x? and N(E 
~ p27 '(F-p)- 7740. r 


Theorem 5.7.1 implies that the procedures in Section 5.3 can be done on 
an asymptotic basis for elliptically contoured distributions. For example, to 
test the null hypothesis pp = p,, reject the null hypothesis if 


(2) N(¥ — py)'S7'(¥- By) 2x7 (a), 


where x; (a) is the a-significance point of the y*-distribution with p degrees 
of freedom the limiting probability of (2) when the null hypothesis is true 
and N- oo is a. Similarly the confidence region N(¥ — m)’S~'(¥ — m) < 
x, (a) has limiting confidence 1 — a. 


5.7.2, Elliptically Contoured Matrix Distributions 
Let X (N Xp) have the density 


@) Jel “gl O-'(X— oyu (X- ev (CD ] 


based on the left spherical density g(Y’Y). Here Y has the representation 
Y £ UR’, where U (N Xp) has the uniform distribution on O(N Xp), R is 
lower triangular, and U and R are independent. Then X £ eyv’ + UR'C’. 
The T*-criterion to test the hypothesis vy = 0 is Nx’S~'X, which is invariant 
with respect to transformations X — XG. By Corollary 4.5.5 we obtain the 
following theorem. 


Theorem 5.7.2. Suppose X has the density (3) with v=0 and T* = 
Nx'S~'X. Then [T?/0N- DCN —p)/p] has the distribution of Pia 
(x2 /P)/| xi—p/(N ~ Pr. 


Thus the tests of hypotheses and construction of confidence regions at 
stated significance and confidence levels are valid for left spherical distribu- 
tions. 

The T*-criterion for H: v = 0 is 


(4) T? = NE'S"'x © Nu’S:'a, 
since X £ UR'C', 


(5) x! = LeyX = [ aye) R'C’ =a'(CR)’, 


PROBLEMS 201 
and ° 


(6) S= se CXR Nxx') = apg [CRU'URC — CRaiu'(C'R)' 


= CRS,(CR)'. 


5.7.3. Linear Combinations 


Lauter, Glimm, and Kropf (1996a, 1996b. 1996c) have observed that a statisti- 
cian can use X'X = CRR'C’' when v= 0 to determine a p Xq matrix VD und 
base a T-test on the transform Z = XD. Specifically, define 


(7) Z' = NenZ=#'D, 

1 
(8) S, = y= (2Z'Z - Ni’) = D’SD, 
(9) TB = NBS; '2'. 


Since Q,yZ2Q,UR'C' £ UR'C' =Z, the matrix Z is based on the left- 
spherical YD and hence has the representation Z = VR*', where V (N xq) 
has the uniform distribution on O(N Xp), independent of R*' (upper 
triangular) having the distribution derived from R*R*’ =Z'Z. The distribu- 
tion of T*/(N — 1) is F,, yq/(N — q)- 

The matrix D can also involve prior information as well as knowledge of 
X’X. If p is large, g can be small; the power of the test based on 75 may be 
more powerful than a test based on T°. 

Lauter, Glimm, and Kropf give several examples of choosing D. One of 
them is to chose D (p X 1) as [Diag(X'X)~ ‘es where Diag A is a diagonal 
matrix with ith diagonal element a,,. The statistic T is called the standard- 
ized sum statistic: 


PROBLEMS 


§.1. (Sec. 5.2) Let x, be distributed according to N(w+B(z,—-2),2), a= 
1,...,N, where 7=(1/N)¥z,, Let b=[1/2(z, —2)7JEx,(z, -2),(N - DS = 
Lx, —x¥—b(z, -D)[x, -—*F- b(z, -Z)]', and T?=L(z,—7)°b'S”'b. Show 
that T? has the T?-distribution with N~2 degrees of freedom. [Hinr: See 
Problem 3.13.] 


§.2. (Sec. 5.2.2) Show that T? /(N — 1) can be written as R*/(1 — R°) with the cor- 
respondences given in Table 5.1. 


202 


§.3. 


5.4. 


5.5. 


5.6. 


THE GENERALIZED T2-STATISTIC 


Table 5.1 

Section 5.2 Section 4.4 
Toa = 1/¥N ta 

Xx, zo) 

NX Quay = rece! 
B=Lx,x, Ay = L20)z2" 
l= UxXb0 ay = Dera 

T+ R? 
NT 1-R? 
Pp pot 
N n 
(Sec. 5.22) Let 
R? Dig #, (Ly %y) Ly Ly 


[oR - Dee tia(lee) hae, 


where u,,....4y are N numbers and x,,...,X, are independent, each with the 
distribution N(0, %). Prove that the distribution of R?/(1 — R7) is independent 
of u,....,Uy- [Hint: There is an orthogonal N XN matrix C that carries 
(t¢,,...,4y) into a vector proportional to (1/ VN,-..,1/¥N).] 


(Sec. 5.2.2) Use Problems 5.2 and 5.3 to show that [T?/(N — DIN — p)/p] 
has the F, y_,-distribution (under the null hypothesis). [Note: This is the 
analysis that corresponds to Hotelling’s geometric proof (1931).] 


(Sec. 52.2) Let T*=Nx’S"'¥, where ¥ and § are the mean vector and 
covariance matrix of a sample of N from N(p,Z). Show that T? is distributed 
the same when p. is replaced by A =(r,0,...,0)', where 77 = p’X~p, and & is 
Teplaced by J. 


(Sec. 5.2.2) Let u=([T?7AN - 1/1 + 7?7/(N -— 1). Show that u = 
¥V'(VV')~ Vy, where y =(1/¥N,...,1/¥N) and 


PROBLEMS 


5.7. (Sec. §22) Let 


f 
v 1 
i 
v* =v, -— ——v, =v, | 1 - — vip, |, 
yyy yyy 


N.S NS vw 
vy; 

yea! : 
* 
Vp 


2 1y\2 
ne aD (vv 
vivT’ viv, 
#\ fk ke ary 7! 
Vo V4U5 Vav 
1 x . 
weer | | 
p* p* pe! p® y*! 
I pre e Vy 
Hint: EV -= V*, where 
1 0 0 
v2V, 
- 7 | 0 
viv, 
E= F 
ee 
viv; 
a 0 1 
vv; 


293 


il, 


5.8. (Sec. 5.2.2) Prove that w has the distribution of the square of a multiple 
correlation between one vector and p-—1 vectors in (N—1}space without 


subtracting means; that is, it has density 


rEv-] 
Mp) IF(p- 


wile B-1¢y - wytNop)-1 


[ Hint: The transformation of Problem 5.7 is a projection of v3,..., v,, Y on the 


(N — 1)-space orthogonal to v,.] 


5.9. (Sec. 52.2) Verify that r=s/(1 —s) multiplied by (N —1)/1 has the noncen- 
tral F-distribution with 1 and N—-—1 degrees of freedom and noncentrality 


parameter Nr’. 


204 


5.10. 


5.11. 


5.12. 


5.13. 


5.14. 


§.15. 


5.16. 


5.17. 


5.18. 


5.19. 


THF GENERALIZED T?-STATISTIC 
(Sec. 5.2.2) From Problems 5.5—-5.9, verify Corollary 5.2.1. 


(Sec. 5.3) Use the data in Section 3.2 to test the hypothesis that neither drug 
has a soporific effect at significance level 0.01. 


(Sec. 5.3) Using the data in Section 3.2, give a confidence region for p with 
confidence coefficient 0.95. 


(Sec. 5.3) Prove the statement in Section 5.3.6 that the T?-statistic is indepen- 
dent of the choice of C. 


(Sec. 5.5) Use the data of Problem 4.41 to test the hypothesis that the mean 
head length and breadth of first sons are equal to those of second sons at 
significance level 0.01. 


(Sec. 5.6.2) T?-test as a Bayes procedure [Kiefer and Schwartz (1965)]. Let 
X1)++.3Xy be independently distributed, each according to N(p, %). Let IT, be 
defined by [p, 2]=[0,(7 + q9')~'] with having a density proportional to 
[1+ 4'|~ 2%, and let Tl, be defined by [p, 2] =[(1+ qa’) y+ q9’)7'] 
with q having a density proportional to 


[n+ qq") 7% exp[ !Nq’(I+-qn')'u]. 


(a) Show that the neasures are finite for N > p by showing y’(1 + 47') qs 1 
and verifying that the integral of [I+ 4/17 2% = (1+ q9')7 4 ig finite. 

(b) Show that the inequality (26) is equivalent to Ax(ZN x x) eek. 
Hence the T?-test is Bayes and thus admissible. 


(Sec. 5.6.2) Let g(t)=flty, +(1-1)y,], where f(y) is a real-valued function 
of the vector y. Prove that if g(t) is convex, then f(y) is convex. 


(Sec. 5.6.2) Show that z’B~'z is a convex function of (z,B), where B is a 
positive definite matrix. [ Hint: Use Problem 5.16.] 


(Sec. 5.6.2) Prove that if the set A is convex, then the closure of A is convex. 


(Sec. 5.3) Let ¥ and S be based on N observations from N(p, 2%), and let x 


be an additional observation from N(p, 2%). Show that x—¥ 1s distributed 
according to 


N[0,(1+1/N)2]. 


Verify that [N/(N + 1)(x - x)’S~'(x-¥) has the T?-distribution with N —- 1 
degrees of freedom. Show how this statistic can be used to give a prediction 
region for x based on ¥ and S (i.e, a region such that one has a given 
confidence that the next observation will fall into it). 


PROBLEMS 205 


§.20. (Sec. 5.3) Let x( be observations from N(w!”, %,), a= 1,...,N. i= 1,2. Find 


§.21. 


5.22. 


5.23. 


5.24, 


5.25, 


5.26, 


the likelihood ratio criterion for testing the hyxothesis pC) = pi. 


(Sec, 5.4) Prove that w= ~'p is larger for pw’ = (423, 2.) than for p= n, by 
verifying 


2 
Mf et 4 ime | WE) LE, Cn 002 r/o) 
1-p*\o; 0:02 5 oO; (1- p*)o? 


Discuss the power of the test 4, = 0 compared to the power of the test yz, = 0, 
fy = 0. 


(Sec. 5.3) 


(a) Using the data of Section 5.3.4, test the hypothesis 40 = y'?. 
(b) Test the hypothesis np? = pP, nW = u?. 


(Sec. 5.4) Let 


pi) su ar Xn 
po? Xa Loo 


Prove wd po pS 7,'w?. Give a condition for strict inequality to hold. 
[ Hint: This is the vector analog of Problem 5.21.] 


Let Xr = (YO, Z@') j= 1,2, where Y has p components and Z'’ has q 
components, be distributed according to N(p“?, 2), where 


cm) 
pli = Ry , te Bay as ; i= 1,2. 
To > ae 


Find the likelihood ratio criterion (or equivalent T°-criterion) for testing pw = 
nw) given ps =p on the basis of a sample of N, on X“), i= 1,2. (Hint: 
Express the likelihood in terms of the marginal density of Y“’ and the 
conditio.al density of 2“ given ¥“] 


Find the distribution of the criterion in the preceding problem under the null 
hypothesis. 


(Sec. 5.5) Suppose xi¥) is an observation from N(p'*’.2,), a=L..... N.. 
g=i,...,4. 


206 THE GENERALIZED T?-STATISTIC 


(a) Show that the hypothesis p!!’ = -- = pl? is equivalent to &y)=0, 
i=l,...,q— 1, where 
ne," t | Ny | Ny 
ye = a) + y a5 | xe) — We AP + —, s xis) |, 
ge2 8 p=) (NLN,)? p=1 


N, SN. £=2,..-,43 and (af!,...,a), i= 1,...,@—1, are linearly inde- 
peildent. 

(b) Show how to construct a T?-test of the hypothesis using (y@’,..., p79 ')’ 
yielding an F-statistic with (g-— 1)p and NM — (q — 1)p degrees of freedom 
[Anderson (1963b)]. 


5.27. (Sec, 5.2) Prove (25) is the density of V= x;'/( x} + x}). [Hint: In the joint 
density of U= y?2 and W = y? make the transformation u = vw(1 - v)"!, w=w 
and integrate out w.] 


CHAPTER 6 


Classification of Observations 


6.1. THE PROBLEM OF CLASSIFICATION 


The problem of classification arises when an investigator makes a number of 
measurements on an individual and wishes to classify the individual into one 
of several categories on the basis of these measurements. The investigator 
cannot identify the individual with a category directly but must use these 
measurements. In many cases it can be assumed that there are a finite num- 
ber of categories or populations from which the individual may have come and 
each population is characterized by a probability distribution of the measure- 
ments. Thus an individual is considered as a random observation from this 
population. The question is: Given an individual with certain measurements, 
from which population did the person arise? 

The problem of classification may be considered as a problem of “statisti- 
cal decision functions.” We have a number of hypotheses: Each hypothesis is 
that the distribution of the observation is a given one. We must accept one of 
these hypothzses and reject the others. If only two populations are admitted, 
we have an elementary problem of testing one hypothesis of a specified 
distribution against another. 

In some instances, the categories are specified beforehand in the sense 
that the probability distributions of the measurements are assumed com- 
pletely known. In other cases, the form of each distribution may be known, 
but the parameters of the distribution must be estimated from a sample from 
that population. 

Let us give an example of a problem of classification. Prospective students 
applying for admission into college are given a battery of tests; the vector of 


An Introduction to Multivariate Statistical Analysis, Third Edition. By T. W. Anderson 
ISBN 0-471-36091-0 Copyright © 2003 John Wiley & Sons, Inc. 


207 


208 CLASSIFICATION OF OBSERVATIONS 


scores is a set of measurements x. The prospective student may be a member 
of one population consisting of those students who will successfully complete 
college training or, rather, have potentialities for successfully completing 
training, or the student may be a member of the other population, those who 
will not complete the college course successfully. The problem is to classify a 
student applying for admission on the basis of his scores on the entrance 
examination. 

In this chapter we shall develop the theory of classification in general 
terms and then apply it to cases involving the normal distribution. In Section 
6.2 the problem of classification with two populations is defined in terms of 
decision theory, and in Section 6.3 Bayes and admissible solutions are 
obtained. In Section 6.4 the theory is applied to two known normal popula- 
tions, differing with respect to means, yielding the population linear dis- 
criminant function. When the parameters are unknown, they are replaced by 
estimates (Section 6.5). An alternative procedure is maximum likelihood. In 
Section 6.6 the probabilities of misclassification by the two methods are evalu- 
ated in terms of asymptotic expansions of the distributions. Then these devel- 
opments are carried out for scvcral populations. Finally, in Scctiou 6.10 linear 
procedures for the two populations are studied when the covariance matrices 
are different and the parameters are known. 


6.2. STANDARDS OF GOOD CLASSIFICATION 


6.2.1. Preliminary Considerations 


In constructing a procedure of classification, it is desired to minimize the 
probability of misclassification, or, more specifically, it is desired to minimize 
on the average the bad effects of misclassification. Now let us make this 
notion precise. For convenience we shall now consider the case of only two 
categories. Later we shall treat the more general case. This section develops 
the ideas of Section 3.4 in more detail for the problem of two decisions. 

Suppose an individual is an observation from either population 7, or 
population 7. The classification of an observation depends on the vector of 
measurements x’ = (x),..., xy) on that individual. We set up a rule that if an 
individual is charactcrized by certain sets of valucs of x,,...,*, that person 
wiil be classified as from 7,, if other values, as from 74. 

We can think of an observation as a point in a p-dimensional space. We 
divide this space into two regions. If the observation falls in R,, we classify it 
as coming from population 7,, and if it falls in R, we classify it as coming 
from population 77,. 

In following a given classification procedure, the statistician can make two 
kinds of errors in classification. If the individual is actually from 7,, the 


6.2 STANDARDS OF GOOD CLASSIFICATION 209 


Table 6.1 


Statistician’s Decision 
Ty Ta 


; 7; 0 C2] 1) 
Population 
> C(1{ 2) 0 


siatistician can classify him or her as coming from population 7; if from 7. 
the statistician can classify him or her as from a,. We need to know the 
relative undesirability of these two kinds of misclassification. Let the cost of 
the first type of misclassification be C(2|1) (> 0), and let the cost of mis- 
classifying an individual from 7, as from a, be C(1|2) (> 0). These costs 
may be measured in any kind of units. As we shall see later, it is only the 
ratio of the two costs that is important. The statistician may not know these 
costs in each case, but will often have at least a rough idea of them. 

Table 6.1 indicates the costs of correct and incorrect classification. Clearly, 
a good classification procedure is one that minimizes in some sense or other 
the cost of misclassification. 


6.2.2. Two Cases of Two Populations 


We Shall consider ways of defining “minimum cost” in two cases, In onc case 
we shall suppose that we have a priori probabilities of the two populations. 
Let the probability that an observation comes from population a, be g, and 
from population 7, be g, (q, +q.= 1). The probability properties of popu- 
lation 7, are specified by a distribution function. For convenience we shall 
treat only the case where the distribution has a density, although the case of 
discrete probabilities lends itself to almost the same treatment. Let the 
density of population 7, be p,(x) and that of a, be p(x). If we have a 
region R, of classification as from 7,, the probability of correctly classifying 
an observation.that actually is drawn from population 7, is 


(1) PCR) = f pix) a. 


where dx = dx, »-- dx,, and the probability of misclassification of an observa- 
tion from 7, Is 


(2) P(2\1, R) = f Pe) dx. 
Similarly, the probability of correctly classifying an observation from 7, is 


(3) P‘2\2, R) = [ Po(x) dx, 


210 CLASSIFICATION OF OBSERVATIONS 


and the probability of misclassifying such an observation is 
(4) P(112,R) = f p(x) de. 
Ry 


Since the probability of drawing an observation from 7, is q,, the 
probability of drawing an observation from 7, and correctly classifying it is 
q,P(1|1, R); that is, this is the probability of the situation in the upper 
left-hand corner of Table 6.1. Similarly, the probability of drawing an 
observation from 7, and misclassifying it is q,P(2l1,R). The probability 
associated with the lower left-hand corner of Table 6.1 is q,P(1|2, R), and 
with the lower right-hand corner is q, P(2|2, R). 

What is tac average or expected loss from costs of misclassification? It is 
the sum of the products of costs of misclassifications with their respective 
probabilities of occurrence: 


(5) C(2|1) P(2|1, R)gq, + C(112) PC112, R)qz- 


It is this average loss that we wish to minimize. That is, we want to divide our 
space into regions R, and R, such that the expected loss is as small as 
possible. A procedure that minimizes (5) for given q, and q, is called a Bayes 
procedure. 

In the example of admission of students, the undesirability of misclassifica- 
tion is, in one instance, the expense of teaching a student who will nox 
complete the course successfully and is, in the other instance, the undesirabil- 
ity of excluding from college a potentially good student. 

The other case we shall treat is that in which there are no known a priori 
probabilities. In this case the expected loss if the observation is from 7 1s 


(6) C(2|1) P (2/1, R) =r(1, R); 
the expected loss if the observation is from 77, is 
(7) C(112) P(112, R) = r(2, R). 


We do not know whether the observation is from 7, or from 7,, and we do 
not know probabilities of these two instances. 

A procedure R is at least as good as a procedure R* if r(1, R) <r(1, k*) 
and r(2, R) < r(2, R*); R is better than R* if at least one of these inequalities 
is a strict inequality. Usually there is no one procedure that is better than all 
other procedures or is at least as good as all other procedures. A procedure 
R is called admissible if there is no procedure better than R; we shall be 
interested in the entire class of admissible procedures. It will be shown that 
under certain conditions this class is the same as the class of Bayes proce- 


6.3 CLASSIFICATION INTO ONE OF TWO POPULATIONS 211 


dures. A class of procedures is complete if for every procedure outside the 
class there is one in the class which is better; a class is called essentially 
complete if for every procedure outside the class there is one in the class 
which is at least as good. A minimal complete class (if it exists) is a complete 
class such that no proper subset is a complete class; a similar definition holds 
for a minimal essentially complete class. Under certain conditions we shall 
show that the admissible class is minimal complete. To simplify the discussion 
we shall consider procedures the same if they only differ on sets of probabil- 
ity zero. In fact, throughout the next section we shall make statements which 
are meant to hold except for sets of probability zero without saying so explicitly. 

A principle that usually leads to a unique procedure is the minimax 
principle. A procedure is minimax if the maximum expected loss, r(i, R), is a 
minimum. From a conservative point of view, this may be consideied an 
optimum procedure. For a general discussion of the concepts in this section 
and the next see Wald (1950), Blackwell and Girshick (1954), Ferguson 
(1967), DeGroot (1970), and Berger (1980b). 


6.3. PROCEDURES OF CLASSIFICATION INTO ONE OF TWO 
POPULATIONS WITH KNOWN PROBABILITY DISTRIBUTIONS 


6.3.1. The Case Wher A Priori Probabilities Are Known 


We now turn to the problem of choosing regions R, and R, so as to mini- 
mize (5) of Section 6.2. Since we have a priori probabilities, we can define joint 
probabilities of the population and the observed set of variables. The prob- 
ability that an observation comes from 7, and that each variate is less than 
the corresponding component in y is 


(1) foo fap) dx, ++ dx. 


We can also define the conditional probability that an observation came from 
a certain population given the values of the observed variates. For instance, 
the conditional probability of coming from population 7,, given an observa- 
tion x, Is 


(2) 4, p(X) +Q2 p(x) - 


Suppose for a moment that C(1|2) = C(2|1) = 1. Then the expected loss is 


(3) aif pile) deaf po(x) dr. 


212 CLASSIFICATION OF OBSERVATIONS 


This is also the probability of a misclassification; hence we wish to minimize 
the probability of misclassification. 
For a given observed point x we minimize the probability of a misclassifi- 


cation by assigning the population that has the higher conditional probability. 
If 


(4) ap *) +42p2(*) = TPA) +92 PAX) 


we choose population 7,. Otherwise we choose population 7. Since we 
minimize the probability of misclassification at each point, we minimize it 
over the whole space. Thus the rule is 

Ri 4 Pix) 242 p2{*), 

Ry: Ma Pix) <42P2(*)- 


(5) 


If q,p\(x) = q, p(x), the point could be classified as either from 7, or 77; 
we have arbitrarily put it into Ry. If q, p;)(x) + q,p2(x) = 0 for a given x, that 
point also may go into either region. 

Now let us prove formally that (5) is the best procedure. For any proce- 
dure R* = (Rf, R}), the probability of misclass‘fication is 


(6) aif Pz) det arf pal x) de 


- f lapis) ~ 42 pr x)] de + 4 f p2(x) ae. 


On the right-hand side the second term is a given number; the first term is 
minimized if R% includes the points x such that q,p,(x) — q, p,(x) <0 and 
excludes the points for which q, px) — q, p(x) > 0. If 


' px) — qQ 
7) FG) ” 


n)=0 i=1,2. 


then the Bayes procedure is unique except for sets of probability zero. 

Now we notice that mathematically the problem was: given nonnegative 
constants q, and q, and nonnegative functions p(x) and p(x), choose 
regions R, and R, so as to minimize (3). The solution is (5). If we wish to 
minimize (5) of Section 6.2, which can be written 


(8) [C(2lt)qu] f) pi) de + [C(12) 40] f pal x) ae, 


6.3 CLASSIFICATION INTO ONE OF TWO POPULATIONS 213 


we choose R, and R, according to 


9) Ry: [C(2] 1) qq] pi(x) = [(CO12) qo] p(x), 
R,: [C(211)q,] p(x) < [C(112)q5] p(x), 


since C(2|1)q, and C(1|2)q, are nonnegative constants. Another way of 
writing (9) is 
p.. PUR) , C(M2)a, 
* p(x) ~ C211) ay’ 


_ px) ¥ C(112)q, 
"pAx) — C(2|1)q,° 


(10) 


2 


Theorem 6.3.1. Jf q, aid q, are a priori probabilities of drawing an 
observation from population +t, with density p(x) and 1, with density p(x), 
respectively, and if the cost of misclassifying an observation from a, as from ry 
is C(2|1) and an observation from 7, as from mw, is C(1|2), then the regions of 
classification R, and R,, defined by (10), minimize the expected cost. If 


PAx) — q,C(112) : a 
(11) pr{ Bite) = acl) n| = 0, i= 1,2, 


then the procedure is unique except for sets of probability zero. 


6.3.2. The Case When No Set of A Priori Probabilities Is Known 


In many instances of classification the statistician cannot assign a priori 
probabilities to the two populations. In this case we shall look for the class of 
admissible procedures, that is, the set of procedures that cannot be improved 
upon. 

First, let us prove that a Bayes procedure is admissible. Let R = (R), Ry) 
be a Bayes procedure for a given qj, q,; is there a procedure R* = (Rj. R>) 
such that P(1|2, R*) <PC|2,R) and P(2|1, R*) < P(2\1, R) with at least 
one strict inequality? Since R is a Bayes procedure, 


(12)  q,P(2\1, R) +q,P(112, R) <q,P(2|1, R*) +q,P(112, R*). 
This inequality can be written 


(13) q,{P(211, R) — P(2I1, R*)] <q2[ P(112, R*) — P(112, R)). 


214 CLASSIFICATION OF OBSERVATIONS 


Suppose 0 <q, <1. Then if P(1|2, R*) < P(1|2, R), the right-hand side of 
(13) is less than zero and therefore P(2|1, R) < P(2|1, R*). Then P(2|1, R*) 
< P(2|1. R) similarly implies P(1]2, R) < P(|2, R*). Thus R* is not better 
than R, and R is admissible. If q, = 0, then (13) implies 0 < P(1|2, R*) — 
P(1|2, R). For a Bayes procedure, R, includes only points for which p,(x) = 0. 
Therefore, P(1|2, R) = 0 and if R* is to be better P(1|2, R*) = 0. If Pr{ p,(x) 
= Ol 7} = 0, then P(2|1, R) = Pr{p,(x) > Ola} = 1. If PC|2, R*) = 0, then 
Ry contains only points for which p,(x)=0. Then P(2|1, R*) = Pr{R*| 77,) 
= Pr{ p(x) > O|7,} = 1, and R* is not better than R. 


Theorem 6.3.2. If Pr{ p(x) =0l7,}=0 and Pr{p,(x) = 0| 2.) = 0, then 
every Bayes procedure is admissible. 


Now let us prove the converse, namely, that every admissible procedure is 
a Bayes procedure. We assume’ 


Pi(x) _ 
(14) pr{ Bu k 


ni} =0 i=1,2, 0<k<o. 


Then for any qg, the Bayes procedure is unique. Moreover, the cdf of 
p\(x)/p.(x) for 7, and 7, is continuous. 
Let R be an admissible procedure. Then there exists a k such that 


Ace P(X) eile 
(15) P(2|1, R) =P fae sk ' 


= P(2|1,R*), 


where R* is the Bayes procedure corresponding to q./q, =k [i.e., gq, = 1/0 
+k)j. Since R is admissible, P(1|2, R) <P(1|2, R*). However, since by 
Theorem 6.3.2 R* is admissible, P(1|2, R) => P(1|2, R*); that is, P(1|2, R) = 
P(1|2, R*). Therefore, R is also a Bayes procedure; by the uniqueness of 
Bayes procedures R is the same as R*. 


Theorem 6.3.3. Jf (14) holds, then every admissible procedure is a Bayes 
procedure, 


The proof of Theorem 6.3.3 shows that the ciass of Bayes procedures is 
complete. For if R is any procedure outside the class, we construct a Bayes 
proccdure R* so that P(2|1, R) = P(2|1, R*). Then, since R* is admissible, 
P(1|2. R) > P(|2, R*). Furthermore, the class of Bayes procedures is mini- 
mal complete since it is identical with the class of admissible procedures. 


*ple)/plx) =x means p.(x)=0, 


6.4 CLASSIFICATION INTO ONE OF TWO NORMAL POPULATIONS 215 


Theorem 6.3.4. Jf (14) holds, the class of Bayes procedures is minimal 
complete. 


Finally, let us consider the minimax procedure. Let P(ilj, q,) =P(ilj, R), 
where R is the Bayes procedure corresponding to q,. PCil/,q,) is a continu- 
ous function of qg,. P(2|1,q,) varies from 1 to 0 as q, goes from 0 to 1; 
P(1|2, q,) varies from 0 to 1. Thus there is a value of q,, say qf, such that 
P(2|1, gf) = P(|2, q¥#). This is the minimax solution, for if there were 
another procedure R* such that max{P(2|1, R*), P(|2, R*)} < PQ|1, qf) = 
P(1|2, qf), that would contradict the fact that every Bayes solution is admissi- 
ble. 


6.4. CLASSIFICATION INTO ONE OF TWO KNOWN MULTIVARIATE 
NORMAL POPULATIONS 


Now we sliall use the general procedure outlined above in the case of two 
multivariate normal populations with equal covariance matrices, namely, 
N(w, %) aad N(w, E), where pl’ = (ui,..., i) is the vector of means 
of the ‘th population, i= 1,2, and & is the matrix of variances and covari- 
ances of each population. [The approach was first used by Wald (1944).] 
Then the ith density is 


(1) px) = exp[ —4$ (x — p!)'E71 (x - p)]. 


1 
(2)? |S]? 
The ratio of densities is 


pit) _ xpl—a(a— wi?) '(x— wi? )] 
pot) exp[—3(2- pO) E7'(z— p)] 
= exp = H[(2= wV EH = w) 


~ (x= BO) E(x pf}. 


(2) 


The region of classification into 7,, Rj, is the set of x’s for which (2) is 
greater than or equal to k (for & suitably chosen). Since the logarithmic 
function is monotonically increasing, the inequality can be written in terms of 
the logarithm of (2) as 


(3) —H(e- eS pl) = (rp?) EN (e- py] > log k. 


216 CLASSIFICATION OF OBSERVATIONS 
The left-hand side of (3) can be expanded as 
(4) - Alx'yo lx =x SoD — Ory ly — pry 

Kx SO pa Ep ¢ pre — py), 
By rearrangement of the terms we obtain 
(5) xP E71 (pl) — pw) — 2p + Theo > amma @ Tle ~ p), 


The first term is the well-known discriminant function. It is a function of the 
components of the observation vector. 


The following theorem is now a direct consequence of Theorem 6,3.1, 


Theorem 6.4.1. If a, has the density (1), '=1,2, the best regions of 
classification are given by 


Ry x'E(w — pw) — 2 (W + wOY'E-1(WO — pw) = Jog k, 
6 
0) ys aH MO = wi) — HD + WOVE H( WO — pw) < og 


If a priori probabilities gq, and q, are known, then k is given by 


_ q,C (112) 
(7) ie 4,C (2/1) * 


In the particular case of the two populations being equally likely and the 
costs being equal, k = 1 and !og k = 0. Then the region of classification into 
cn 


(8) Ry: x ZA" (pl? — pw) > E(w + a)? (pl — pe). 


If we de not have a priori probabilities, we may select log k = c, say, on the 
basis of making the expected losses due to misclassification equal. Let X bea 
Tanaom observation. Then we wish to find the distribution of 


(9) U =X'L (pa 2 pi)) ted +(p a Tice > Saat ¢ ie = pw) 


on the assimption that X is distributed according to N(w”, %) and then on 
the assumption that X is distributed according to N(w™,%). When X is 
distributed according to N(w"), %), U is normally distributed with mean 


(10) BU = pS (pt? — pw) — Ea ¢ YS — pp) 


= 4p = pyre l(t? = an 


6.4 CLASSIFICATION INTO ONE OF TWO NORMAL POPULATIONS 217 

and variance 

(11) Var (U) = &(w? — w)'E-1(X — wP)(X- wPY E(w ~ pw?) 
= (pO = p71 (pw — pw), 

The Mahalanobis squared distance between N(w"), %) and N(p, Z) is 

(12) (w= pOy zt (pl — pO) = a’, 


say. Then U is distributed according to N(4A*, A?) if X is distributcd 
according to N(w") 3). If X is distributed according to N(p, ), then 
(13) €,U =, pore! (pi? ues pe?) = £(pi” 4 pyr! (po pe pi?) 

= 4 (i) = Dy ZH = @)) 

= —tA’, 
The variance is the same as when X is distributed according to N(q!’. &) 
because it depends only on the second-order moments of X. Thus U is 


distributed according to N(— 4A’, 4°). 
The probability of misclassification if the observation is from 77, is 


1 Ai ltatunt re~ta)yA La: 
4 = glz- garlya on yqvr/ BN ; 
(14) P(2|1)= = e dz= f° ad dy, 


and the probability of misclassification if the observation is from 7, is 


oO l 1 Taly2 oe 1 | eed 
15) P(112) = 304 10"S dz me eddy, 
(18) P(I2)= J ere aed er rea 


Figure 6.1 indicates the two probabilities as the shaded portions in the tails 


-4a" 0 ¢ 4a’ 
Figure 6.1 


218 CLASSIFICATION OF OBSERVATIONS 


For the minimax solution we choose c so that 


= ee : o-taya Loa: 
16) C(112 e7 3" dy=C(2\1 ; ——e™" dy, 
ese or v27 ( df" Vv27 » 


Theorem 6.4.2. Jf the a, have densities (1), i= 1,2, the minimax regions of 
classification are given by (6) where c = log k is chosen by the condition (16) with 
CClj) the two costs of misclassification. 


It should be noted that if the costs of misclassification are equal, c = 0 and 
the probability of misclassification is 


a | 12 
17 ——e™*" dy. 
mm i v27 :; 


In case the costs of misclassification are unequal, c could be determined to 
sufficient accuracy by a trial-and-error method with the normal tables. 
Both terms in (5) involve the vector 


(18) B= UV (pl — pe), 
This is obtained as the solution of 
(19) YB = (pw — pl) 


by an efficient computing method. The discriminant function x'd is the linear 
function that maximizes 


[2 X'd) = &(X'd)]° 


2) Var( X'd) 


for all choices of d. The numerator of (20) is 

(24) [wird — word]? =a"[(w — pw) (wi — pw) Jd 
the denominator is 

(22) d'&(X~ &X)(X- €X)'d=d'Sd. 


We wish to maximize (21) with respect to d, holding (22) constant. If A is a 
Lagrange multiplicr, we ask for the maximum of 


(23) d' | (w’? — pw) (pt? — pw) ]d ~ A(d’Ed— 1). 


6.5 CLASSIFICATION WHEN THE PARAMETERS ARE ESTIMATED 219 


The derivatives of (23) with respect to the components of d are set equal to 
zero to obtain 


(24) 2[(m? — w?)(u? — w)'|d = 2a. 


Since (yu!) — w)’d is a scalar, say v, we can write (24) as 


A 
(De 4Q) 
(25) Th pea Xd. 


Thus the solution is proportional to 8. 

We may finally note that if we have a sample of N from either 7, or 7, 
we use the mean of the sample and classify it as from N[p?,(1/N)2] or 
N[p,(1/N)E) 


6.5. CLASSIFICATION INTO ONE OF TWO MULTIVARIATE NORMAL 
POPULATIONS WHEN THE PARAMETERS ARE ESTIMATED 


6.5.1. The Criterion of Cla: sification 


Thus far we have assumed that the two populations are known exactly, In 
most applications of this theory the populations are not known, but must be 
inferred from samples, one from each population. We shall now treat the 
case in which we have a sample from each of two normal populations and we 
wish to use that information in classifying another observation as coming 
from one of the two populatious. 

Suppose that we have a sample x{",...,x{? from N(w?, ) and a sample 
x}?,...,49? from N(w™, Z). In one terminology these are “training samples.” 
On the basis of this information we wish to classify the observation x as 
coming from 7, to 7,. Clearly, our best estimate of p' is ¥ = DMxY/N,, 
of p is ¥? = LMx@ /N,, and of E is S defined by 


N, 


(1) (N, +N, —2)S= Y (x -2M)(x@ - 20): 


a=! 


N. 
+E (xO -2%)(2 - 2)’, 


a=l 


We substitute these estimates for the parameters in (5) of Section 6.4 to 
obtain 


(2) W(x) =x'S71( 2) — 8) — 5 EMIS KH HO — xO), 


220 CLASSIFICATION OF OBSERVATIONS 


The first term of (2) is the discriminant function based on two samples 
[suggested by Fisher (1936)]. It is the linear function that has greatest 
variance between samples relative to the variance within samples (Problem 
6.12). We propose that (2) be used as the criterion of classification in the 
same way that (5) of Section 6.4 is used. 

When the populations are known, we can argue that the classification 
criterion is the best in the sense that its use minimizes the expected loss in 
the case of known a priori probabilities and generates the class of admissible 
procedures when a priori probabilities are not known. We cannot justify the 
use of (2) in the same way. However, it seems intuitively reasonable that (2) 
should give good results. Another criterion is indicated in Section 6.5.5. 

Suppose we have a sample x,,...,x, from either 7, or 7,, and we wish 
to classify the sample as a whole. Then we define S$ by 


N, 
(3) (M tM +N-3)S= L(x — 20) (xD - 3%)" 
a=] 
N, N 
+ YS (x2 — 2) (22 — 2) P(x, -¥)(x 8) 
a=l1 a=l1 
where 
1 x 
(4) =H dm 
Then the criterion is 
(5) [e—4(2 + 2@)]/S71(2O - 2), 


The larger N is, the smaller are the probabilities of misclassification. 


6.5.2. On the Distribution of the Criterion 
Let 


(6) Wa X'S7'( XM — XO) — 2 KO 4 XM) 8-1 (XO — XO) 


= [x- 4X +X] 15-1 KO -X%) 


for random X, X™, X@, and S. 


65 CLASSIFICATION WHEN THE PARAMETERS ARE ESTIMATED 221 


The distribution of W is extremely complicated. It depends on the sample 
sizes and the unknown A’, Let 


(7) Y=) X- (Ny +Np) '( Nj XY +N, X°)], 


(8) ¥, =¢,(X-X), 


where c, = ¥(N,+N,)/(N, +N, +1) and c.=YN\Nj/(N, +Nz). Then 

Y, and Y, are independently normally distributed with covariance matrix %. 

The expected value of Y, is c,(p‘ — wu), and the expected value of Y, is 

cL N2 /CN, + NK? — p®) if X is from a, and -¢[N, AN, +NDKp® — 
p”) if X is from ,. Let Y=(¥, ¥,) and 


(9) M=y'sty= ("0 MP] 
my My 
Then 
= Ni +N, + 1 N, —N, 
(10) W=\i rN, 12 + ONL, 2 


The density of M has been given by Sitgreaves (1952). Anderson (1951a) and 
Wald (1944) have also studied the distribution of W. 

If N, =N,, the distribution of W for X from 7, is the same as that of 
—W for X from 7. Thus, if W> 0 is the region of classification as 2,, then 
the probability of misclassifying X when it is from 7, is equal to the 
probability of misclassifying it when it is from 77,. 


6.5.3. The Asymptotic Distribution of the Criterion 


In the case of large samples from N(p‘), %) and N(p®, %), we can apply 
limiting distribution theory. Since X“) is the mean of a sample of N, 
independent observations from N(«'”, %), we know that 


(11) plim X = wl”, 


Noe 


The explicit definition of (11) is as follows: Given arbitrary positive 5 and «. 
we can find N large enough so that for N, = N 


(12) Pr{|X!? ~ w)| <8, i=1,...,p}>1—e. 


222 CLASSIFICATION OF OBSERVATIONS 


(See Problem 3.23.) This can be proved by using the Tchebycheff inequality. 
Similarly. 


(13) plim X° = p, 
N, > 

and 

(14) plim S = 2 


as N, > 2, N, > 00 or as both N,, N, > oo. From (14) we obtain 
(15) plim S-! = x7!, 


since the probability limits of sums, differences, products, and quotients of 
random variatles are the sums, differences, products, and quotients of their 
probability limits as long as the probability limit of each denominator is 
different from zero [Cramér (1946), p. 254]. Furthermore, 


(16) plim S71CXM-— XP) = Eh (pl — pw), 
Ni. Ny 
(17) 
plim (XM +XO)'8-1 (XO — X) = (pw? + pOysnt (at? = p), 
Ny. Na % 


It follows then that the limiting distribution of W is the distribution of U. 
For sufficiently large samples from 7, and 7, we can use the criterion as if 
we knew the population exactly and make only a small error. [The result was 
first given by Wald (1944). ] 


Theorem 6.5.1. Let W be given by (6) with X“ the mean of a sample of N, 
from N(p?, 2%), X® the mean of a sample of N, from N(p®,%), and § the 
estimate of &% based on the pooled sample. The limiting distribution of W as 
N,— 0 and N, > 00 is N($A°, A’) if X is distributed according to N(p”, 2) 
and is N( — 4A*, A*) if X is distributed according to N(w, X). 


6.5.4. Another Derivation of the Criterion 


A convenient mnemonic derivation of the criterion is the use of regression of 
a dummy variate [given by Fisher (1936)]. Let 


N. -N 
(18) P= PR Gee Ms UP = EN ot be Me. 


6.5 CLASSIFICATION WHEN THE PARAMETERS ARE ESTIMATED 223 


Then formally find the regression on the variates x“ by choosing Bb to 
minimize 


2 N, 
(19) Le pe eels), 
i=l a=l 
where 
_ Nx 4N, 5° 
(2) *" NN, 
The normal equations are 
2 2 N; 
(Al) EOP ae P Hs e= Yo ye =*) 
i=. a=) i=] a=[ 
NN. 7 z a a 
= yori (EO - 3) GG - 2) 
= MIN? 7 sw _ 22) 
WEN es 


The matrix multiplying b can be written as 


2 2N, 
(22) ry (4 =2 (x? oe 
i=l a=] 


2 N 


t 


YL (22 -2)(x) — 20)! 


+ N,( ¥\ — )(£ —¥)' + N,( 2 — ¥)(E@ — x)! 


2 ON, 


i 
M 
™M 


(x0 = ae = x)! 


NN, 
x) _ FR)y¢ ZO) _ ¥Q2)yr 
+ N, EN x) ( ¥ page bee 


Thus (21) can be written as 


NN, _ N,N, 


I 
NtN, WN, +N, 


(23) Ab = (¥ — #)| (#0 — 2) 'b], 


224 CLASSIFICATION OF OBSERVATIONS 


where 


N, 


(24) A= YY (xO 2) (x0 — 2M, 


i=l a=I 


Since (x? — x™)’b is a scalar, we see that the solution 5 of (23) is propor- 
tional to $7 '(x) — x2), 


6.5.5. The Likelihood Ratio Criterion 

Another criterion which can be used in classification is the likelihood ratio 
criterion. Consider testing the composite null hypothesis that x, x{,..., xf? 
are drawn from N(p,%) and x{?,...,29 are drawn from N(w”, ¥) 
against the composite alternative hypothesis that x{”,..., x4) are drewn from 
N(w™, Z) and x, x,...,x0) are drawn from N(p™, ), with pw, pO, and 
~ unspecified. Under the first hypothesis the maximum likelihood estimators 
of pw), w, and © are 


Nx) +x 
acy _ 41 
(25) N+l 


po = x@) 
x 1 mM 
B= yea | E (2 aP) ae? APY' + (e-aP)(2- BY 


N 
4. y (x2) — a) (x2 — alP)']. 
u=1 


Since 


Ny 
(26) > (2? = pl?) (2? = fy)’ Ae (x _ pi?) (x _ pi)’ 
a=l 


N 
= x (<) ee | bs aa): 4. N, (= zs) py? )(=% a p>)’ 


a= 


+2 Bh)(x— AY 


fond 


Ni 


N 
= ¥ (xi? =x) =a)! is N, - ' (2-2) (x — 2M)’, 


a=l 


6.5 CLASSIFICATION WHEN THE PARAMETERS ARE ESTIMATED 225 


we can write 2, as 


, 1 N 

27 = a 4 oy 4 OP "Y ae — EO | 
OD Sim arena [At TEE) 
where 4A is given by (24). Under the assumptions of the alternative hypothesis 
we find (by considerations of symmetry) that the maximum likelihood estima- 
tors of the parameters are 


pS? =x"), 
(28) fu? = N,¥O +x 


a Nae? 


is i N, = Sine 
22 = went |4+ Nyt (27 FM aE") ; 


The likelihood ratio criterion is, therefore, the (N, +N, + 1)/2th power of 


|=, Jaa Rett 3) x—¥)' 
(29) Sige tan ai 
| 4+ eres ext | 


This ratio can also be written (Corollary A.3.1) 


N) Sarge . 
1+ N41 (t~20Y'A '(x— x) 


N, 
N, +1 


(30) 


1+ (2 —ED)A( y — ROD) 


N, ‘ 
2 =(2)\te-| =(2) 
aa Aa | p(*-* YS (x - x) 


N eta < : 
n+ yy (x BY'S ter | 


where n= N, +N, — 2. The region of classification into 7, consists of those 
points for which the ratio (30) is greater than or equal to a given number K,,. 
It can be written 


N; a =—(3 
(31) Rt yrs TS eee) 


N 
wns hg (e305) 
l 


226 CLASSIFICATION OF OBSERVATIONS 


If K,=1+2c/n and N, and N, are large, the region (31) is approximately 
W(x) 2c. 

If we take K, = 1, the rule is to classify as 7, if (30) is greater than 1 and 
as 17, if (30) is less than 1. This ts the maximum likelihood rule. Let 


N 
(32) Z=5 Ae (eB) = 2) 
2 


N ee ore 
— Ka (F- 2°) S7'(x—xXM)]. 


Then the maximum likelihood rule is to classify as 7, if Z>0O and 7, if 
Z <Q. Roughly speaking, assign x to m, or 7, according to whether the 
distance to x‘ is less or greater than the distance to ¥™, The difference 
between W and Z is 


1 1 
33 W—Z= =| ———( x — 2y'8-1( x — 7 
(33) iW a eC a 


—tegte-t0ys'e-29)) 
! 


which has the probability limit 0 as N,, N, > 00. The probabilities of misclas- 
sification with W are equivalent asymptoticaily to those with Z for large 
samples. 

Note that for Nj =N..Z=[N,/CN, + |W. Then the symmetric test 
based on the cutoff ¢ = 0 is the same for Z and W. 


6.5.6. Invariance 


The classification problem is invariant with respect to transformations 


xD* = Bx 46, a=1,...,M, 
(34) x2* = By? 4 6, a=1,...,N, 


x* =Bxrt+e, 
where B is nonsingular and c is a vector. This transformation induces the 
following transformation on the sufficient statistics: 
(35) HO* = BYO +e, xO = BY +6, 
x* =Bxt+e, S* = BSB’, 


with the same transformations on the parameters, p'”, w™, and . (Note 
that &x =p or wp.) Any invariant of the parameters is a function of 


6.6 PROBABILITIES OF MISCLASSIFICATION 227 


A? = (yp — yp) 3-1) — x). There exists a matrix B and a vector ¢ 
such that 


(36) pO* = Bp +e=0, pe = By +e=(A,0,...,0)', 
>*=BSB'=I. 


Therefore, A? is the minimal invariant of the parameters, The elements of M 
defined by (9) are invariant and are the minimal invariants of the sufficient 
Statistics. Thus invariant procedures depend on M, and the distribution of M 
depends only on A?. The statistics W and Z are invariant. 


6.6. PROBABILITIES OF MISCLASSIFICATION 


6.6.1. Asymptotic Expansions of the Probabilities of Misclassification 
Using W 


We may want to know the probabilities of misclassification before we draw 
the two samples for determining the classification rule, and we may want to 
know the (conditional) probabili.ies of misclassification after drawing the 
samples. As observed earlier, the exact distributions of W and Z are very 
difficult to calculate. Therefore, we treat asymptotic expansions of their 
probabilities as N, and N, increase. The background is that the limiting 
distribution of W and Z is N(5A*, A*) if x is from 7, and is N(— $A’, A’) if 
x 18 from 74. 

Okamoto (1963) obtained the asymptotic expansion of the distribution of 
W to terms of order n~*, and Siotani and Wang (1975, 1977) to terms of 
order n~?. [Bowker and Sitgreaves (1961) treated the case of N, =N,.] Let 
#(:) and #(-) be the cdf and density of N(0, 1), respectively. 


Theorem 6.6.1, As N, > 00, N, > 00, and N,/N, — a positive limit (n= 


= &(u) - aa aye ls + (p~3)u- pd] 


iW 
(1) Pr (ae <ul 


‘Tel! +2Au?+(p-3+M)u+(p—2)d| 


1 ~ 
+ 4 4 40u? + (6p 6 +4°)u +2(p~ 1] +O(n~*), 


and Pr{--(W + $A*)/A <ul 7,) is (1) with N, and N, interchanged. 


228 CLASSIFICATION OF OBSERVATIONS 


The rule using W is to assign the observation x to 7, if W(x)>c and to 
aw, if W(x)<c. The probabilities of misclassification are given by Theorem 
6.6.1 with u=(c~ 7A7)/A and u= ~—(c + 4A*)/A, respectively. For c = 0, 
u=— 3A. If N,=N,, this defines an exact minimax procedure [Das Gupta 
(1965). 


Corollary 6.6.1 
(2) Pr{ Ws Ol, tim a = 


=o(-4a)+ 4(4a)[ 254 + La] +0(n'y 


N 
= P{W Ola,, lim N. = i} 
nw ? 


Note thai the correction term is positive, as far as this correction goes; 
that is, the probability of misclassification is greater than the value of the 
normal approximation. The correction term (to order n~') increases with p 
for given A and decreases with A for given p. 


Since A is usually unknown, it is relevant to Studentize W. The sample 
Mahalanobis squared distance 


(3) Dee (XH ey Sot xg = FO) 


is an eStimator of the population Mahalanobis squared distance A?. The 
expectation of D? is 


2_ A 2 i hy 
(4) éD -t/6 +l x, + all 


See Problem 6.14. If N, and N, are large, this is approximately A’. 
Anderson (1973b) showed the following: 


Theorem 6.6.2. Jf N,/N, — @ positive limit as n > o, 


—1ip? 
(5) pf 2 <u 


~ (uy ou) ae ($= Pat} + AL + (PG) 


6.6 PROBABILITIES OF MISCLASSIFICATION 229 


a 

= ©(u) - 60) (3 . Po*| ay +|4 re (p- =| HOUR 

Usually, one is interested in u < 0 (small probabilities of error). Then the 
correction term is positive; that is, the normal approximation underestimates 
the probability of misclassification. 

One may want to choose the cutoff point c so that one probability of 
misclassification is controlled. Let a be the desired Pr{W < cl a,}. Anderson 
(1973b, 1973c) derived the following theorem: 


Theorem 6.6.3. Let u, be such that P(ug) = a, and ler 
p-1l 1 


(7) u=Uy — a | RR - FY 


Then as N, > 00, N, > 00, and N,/N,— a positive limit, 


—1lp2 


aS n,| =a+O(n~’). 


Then c = Du + 4D? will attain the desired probability a to within O(n ~*). 

We now turn to evaluating the probabilities of misclassification after the 
two samples have been drawn. Conditional on ¥, ¥© and S, the random 
variable W is normally distributed with conditional mean 


(9) &(WMn,, 2, 29,8) = [a —4(3 $¥ON) |! so etN — #0) 
; = pl)( #20), $) 
when x is from 7, i = 1,2, and conditional variance 
(10) Y (Wx), 2, S) = (#2) — x )'S- 1 ES— lg) — Ze) 
= (xe) 8); 


Note that these means and variance are functions of the samples with 
probability limits 


plim pu? ( 2), 2 sy=(—1y TA, 
Ny. Nor 
(11) 
plim o7(#, 2. S) =A’. 
Ny. Na 


230 CLASSIFICATION OF OBSERVATIONS 


For large N, and N, the conditional probabilities of misclassification are 
close to the limiting normal probabilities (with high probability relative to 
x xO) and S$). 

When ¢ is the cutoff point, probabilities of misclassification conditional on 
x) x@) and S$ are 


(12) P(2l1,c, x, 2, S)=@ 


c— wR, 8) | 


a(x, x2) S) 


(13) P(1/2,¢, 2, #, S)=1- 


c — w(x, 2), $) | 


oe x) , 8) 


In (12) write c as Du,+4D?. Then the argument of ®(-) in (12) is 
uy, D/ao +(x — x61 8-1 x — wi) 7o; the first term converges in probabil- 
ity to #,, the second term tends to 0 as N, — 00, Nj — 00, and (12) to P(u,). 
In (13) write ¢ as Du,—4D°. Then the argument of (-) m (13) is 
uty D/o + (2) — #)'§-1(x@) — Ww) /o. The first term converges in proba- 
bility to uw, and the second term to 0; (13) converges to | — (u,). 

For given x, x@, and S§ the (conditional) probabilities of misclassifica- 
tion (12) and (13) are functions of the parameters p‘, py’, £ and can be 
estimated. Consider them when c=0. Then (12) and (13) converge in 
probability to @(— 4A); that suggests @(— 4D) as an estimator of (12) and 
(13). A better estimator is ®(— 4D), where D? =(n —p—1)D?/n, which 
is closer to being an unbiased estimator of A’. [See (4).] McLachlan 
(1973. 1974a, 1974b, 1974c) gave an estimator of (12) whose bias is of order 
n7*: it is 
(14) o(-4D)+4(4 by Foy eee fe D> + (4-1). 

N,D 32n 
[McLachlan gave (14) to terms of order n™!.] McLachlan explored the 
properties of these and other estimators, as did Lachenbruch and Mickey 
(1968). 

Now consider (12) with c= Du,+ 4}D°; u, might be chosen to control 
P(2|1) conditional on ¥“, ¥@, §. This conditional probability as a function of 
x!) ¥,§ is a random variable whose distribution may be approximated. 
McLachlan showed the following: 


Theorem 6.6.4. As N, +0, N, > 00, and N\/N, — a positive limit, 


2 pulh yu} _< 
1s mie ei a 
b(u,)[4u2 + n/N]! 
2G po ke DAMN i PON 4 


+ O(n). 


Va [Lud +n/N,]} 


6.6 PROBABILITIES OF MISCLASSIFICATION 231 


McLachlan (1977) gave a method of selecting u, so that the probability of 
one misclassification is less than a preassigned 6 with a preassigned confi- 
dence level 1 -~ ¢. 


6.6.2. Asymptotic Expansions of the Probabilities of Misclassification 
Using Z 


We now turn our attention to Z defined by (32) of Section 6.5. The results 
are parallel to those for W. Memon and Okamoto (1971) expanded the 
distribution of Z to terms of order n~?, and Siotani and Wang (1975), (1977) 
to terms of order n7~?, 


Theorem 6.6.5. As N, > 00, N,-> 00, and N,/N, approaches a positive 
limit, 


(16) Pr{ eo <u 


= D(u) 60) sag lw baw? + (p-3)u— a] 
ee [u> + Au? +(p—-3-M)u-a3- Al] 


2N, 0° 
1 Ee 
+ Tn |4u? +4Au* + (6p—64+A’)u+ 2 p— a]} +O(n~*), 
and Pr{—(Z + $A’) /A <ul 7,} is (16) with N, and N, interchanged. 
When c = 0, then u = — $A. If N, =, the rule with Z is identical to the 
rule with W, and the probability of misclassification is given by (2). 


Fujikoshi and Kanazawa (1976) proved 


Theorem 6.6.6 


— 12 


= &(u) ~ 9(4) | gyre lu? + au - (p—-1)] 


~ aprq [ut + 2du tp 140°] 
2 


+ lw + (4p-3yu]} +0(n-), 


232 CLASSIFICATION OF OBSERVATIONS 


- 


= 0(u) ~ (u){— pyc [w+ 2a +p-1+2A°| 


Llp2z 
(18) pr{- 25g ai 


+ ayy lu + Au —(p-1)] + a [e + (4p 3)u}} + O(n77) 


Kanazawa (1979) showed the following: 


Theorem 6.6.7, Let u, be such that ®(uy)) = a, and let 
1 
(19) u=Uy + aN lus + Duo — (p ~ 1] 
Pepin ae 6 +(p-1)-D?| 
2N,D \*0 0 P 


1 
+ glue + (4p —5) uo]. 


Then as N, - 00, Ny > 00, and N,/N, > a positive limit, 
~1p 
(20) pra su}=at O(n), 


Now consider the probabilities of misclassification after the samples have 
been drawn. The conditional distribution of Z is not normal: Z is quadratic 
in x unless N, = N,. We do not have expressions equivalent to (12) and (13). 
Siotani (1980) showed the following: 


Theorem 6.6.8. As N, — 00, N, -> 00, and N,/N, — a positive limit, 


N,N, P(2\1,0, ¥?, x, S)~ (—7A) 
(21) Pe(2 I N ( | 0 x x ) D( 3A Pr" 
2 


N+ 604) 


z NiNo 1 2 
afx-2 N_ +N, (rower - a] 


+O(n™*), 


+ Ton [4(e - 1) +387] tees 


It is also possible to obtain a similar expression for P(2|1, Du, + 
£D?, x‘, ¥@, S$) for Z and a confidence interval. See Siotani (1980). 


6.7 CLASSIFICATION INTO ONE OF SEVERAL POPULATIONS 233 
6.7. CLASSIFICATION INTO ONE OF SEVERAL POPULATIONS 


Let us now consider the problem of classifying an observation into one of 
sevcral populations. We shall extend the consideration of the previous 
sections to the cases of more than two populations. Let 1,...,7,, be m 
populations with density functions p,(x),...,p,,(x), respectively. We wish to 
divide the space of observations into m mutually exclusive and exhaustive 
regions K,,...,R,,. lf an observation falls into R,, We shall say that it comes 
from 7,. Let the cost of misclassifying an observation from a, as coming from 
a, be C(ili). The probability of this misclassification is 


(1) PCjli, R)= J p(x) de. 


Suppose we have a priori probabilities of the populations, q,,...,q,,- Then 
the expected loss is 


nt 


(2) Ya.) LCC ili) PC ili, RY). 


i=] j=l 
J¥t 


We should like to choose R,,..., R,, to make this a minimum. 

Since we have a priori probabilities for the populations. we can define the 
conditional probability of an observation coming from a population given 
the values of the components of the vector x. The conditional probability of 
the observation coming from 1, is 


4, P,(*) 
(3) ca1dk PRC) 


If we classify the observation as from 7,, the expected loss is 


4 P(X) ore 
o LX e194 Pa( +) CUNO: 


ig 


We minimize the expected loss at this point if we choose j so as to minimize 
(4); that is, we consider 


(5) Y ap) CCl?) 
i=l 


14} 


234 CLASSIFICATION OF OBSERVATIONS 


for all j and select that j that gives the minimum. (If two different indices 
give the minimum, it is irrelevant which index is selected.) This procedure 
assigns the point x to one of the R,. Following this procedure for each x, we 
define our regions R,,...,R,,. The classification procedure, then, is to 
classify an observation as coming from 7, if it falls in R.. 


Theorem 6.7.1. Jf q, is the a priori probability of drawing an observation 
from population 1, with density p(x), i= 1,...,m, and if the cost of misclassify- 
ing an observation from 7, as from wm, is C(jli), then the regions of classifica- 
tion, R,,...,R,,, that minimize the expected cost are defined by assigning x to 


R, if 


(6) Vg, p,(x)C(Ali) < a 9, PA x)CC lf), J=l,...um, pH. 


ie] i=l 

imk iy 
[If (6) holds for all j (j # k) except for h indices and the inequality is replaced by 
equality for those indices, then this point can be assigned to any of the h +1 7’s.] 
If the probability of equality between the right-hand and left-hand sides of (6) is 
zero for each k and j under tr, (each i), then the minimizing procedure is unique 
except for sets of probability zero. 


Proof. We now verify this result. Let 


we 


(7) h(x) = XY q.p(x)CCili). 


14s 


Then the expected loss of a procedure R is 
(8) > f b)(2) de fa(xlR) de, 
jel 


where h(x|R) =h,(x) for x in R,. For the Bayes procedure R* described in 
the theorem, A(x|R) is ACx|A*) = min, h,(x). Thus the difference between 
the expected loss for any procedure R and for R* is 


(9) J[AGIR) - hal R*)] de = 2 I. [n,(x) — min A,(x)] de 


= 0. 


Equality can hold only if 4,(x)= min, h,(x) for x in R, except for sets of 
probability zero. a 


6.7 CLASSIFICATION INTO ONE OF SEVERAL POPULATIONS 235 


Let us see how this method applies when C(j|i) = 1 for all i and j, ij. 
Then in R, 


(10) : Lal) < Lapt*)s jek. 
tek rey 


Subtracting L7ny 54% ;9: 2,4) from both sides of (10), we obtain 
(11) 9;P}(*) <qupel2)s jak, 


In this case the point x is in R, if & is the index for which g,px) is a 
maximum; that is, 7, is the most probable population. 

Now suppose that we do not have a priori probabilities. Then we cannot 
define an unconditional expected loss for a classification procedure. How- 
ever, we can define an expected loss on the condition that the observation 
comes from a given population. The conditional expected loss if the observa- 
tion is from 77; is 


m 
(12) % C(jli) PC jli, R) =r(i, R). 

Pa 

a 
A procedure R is at least as good as R* if ri, R)<r(i, R*), i=1,...,m; R 
is better if at least one inequality is strict. R is admissible if there is no 
procedure R* that is better. A class of procedures is complete if for every 
procedure R outside the class there is a procedure R* in the class that is 
better. 

Now let us show that a Bayes procedure is admissible. Let R be a Bayes 

procedure; let R* be another procedure. Since R is Bayes, 


(13) Ear(iR) s Larlisk*), 


Suppose q, > 0, gq, > 0, r(2, R*) < r(2, R), and r(i, R*) <r(i, R), i=3,....m. 
Then 


(14) gilr(1,R)~P(1,R*)] s Lalré, RY) —r(G,R)] <0 


and r(1, R)<r(1, R*). Thus R* is not better than R. 


Theorem 6.7.2. Jf q,;>0, i= 1,...,m, then a Bayes procedure is admissi- 
ble. 


236 CLASSIFICATION OF OBSERVATIONS 


We shall now assume that C(ilj)= 1, i #j, and Pr{ p,(x) = 0| 7} = 0. The 
latter condition implies that all p,(x) are positive on the same set (except fcr 
a set of measure 0). Suppose qg;=0 for i=1,...,4, and g,>0 for i=t+ 
1,...,m. Then for the Bayes solution R,, i=1,...,f, is empty (except for 
a set of probability 0), as seen from (11) [that is, p,,(x)=0 for x in R,]. 
It follows that rGi, R)=L,,,PQli,R)=1—PCli,R)=1 for i=1,...,¢. 
Then (R,,),---,R,) is a Bayes solution for the problem involving 
Pra QX)y 0005 P(X) and Gi415+-+>Gm- It follows from Theorem 6.7.2 that no 
procedure R* for which P(ili, R*)=0, i=1,...,£, can be better than the 
Bayes procedure. Now consider a procedure R* such that Rf includes a set 
of positive probability so that P(1|1, R*) > 0. For R* to be better than R, 


(15) P(ili, R) = J pila) a 
<P(ili, R*) = | p(x) a, i=2,...,m. 


In such a case a procedure R** where R** is empty, i= 1,...,t, R** =R7, 
i=t+1,...,.m—1, and R** =R* URTU-+ UR* would give risks such 
that 


P(ili, R**) =0, P=L,...58, 
(16) —P( ili, R**) = P(ili, R*) = P(ili, R), -i=te+l,...,m—1, 
P(m|m, R**) > P(m|m, R*) > P(mlm,R). 


Then R**,,..., R**) would be better than (R,,,,...,R,,) for the (m — t)- 
decision problem, which contradicts the preceding discussion. 


Theorem 6.7.3. If C(ilj)=1,i#j, and Pr{p,(x) = 0 7,} = 0, then a Bayes 
procedure is admissible. 


The converse is true without conditions (except that the parameter space 
is finite). 


Theorem 6.7.4. Every admissible procedure is a Bayes procedure. 


We shall not prove this theorem. It is Theorem 1! of Section 2.10 of 
Ferguson (1967), for example. The class of Bayes procedures is minimal 
complete if each Bayes procedure is unique (for the specified probabilities). 

The minimax procedure is the Bayes procedure for which the risks are 
equal. 


6.8 CLASSIFICATION INTO ONE OF SEVERAL NORMAL POPULATIONS 237 


There are available general treatments of statistical decision procedures 
by Wald (1950), Blackwell and Girshick (1954), Ferguson (1967), De Groot 
(1970), Berger (1980b), and others. 


6.8. CLASSIFICATION INTO ONE OF SEVERAL MULTIVARIATE 
NORMAL POPULATIONS 


We shall now apply the theory of Section 6.7 to the case in which each 
population has a normal distribution. [See von Mises (1945).] We assume that 
the means are different and the covariance matrices are alike. Let N(w?, 2) 
be the distribution of 7,. The density is given by (1) of Section 6.4. At the 
outset the parameters are assumed known. For general costs with known 
a priori probabilities we can form the m functions (5) of Section 6.7 and 
define the region R, as consisting of points x such that the jth function is 
minimum, 

In the remainder of our discussion We shall assume that the costs of 
misclassification are equal. Then we use the functions 


(1) Uj( x) = PE ay = [x 3 (p? + p)] 2-1 (pe? - a). 


lf a priori probabilities are known, the region R, is defined by those x 
satisfying 


(2) Ry: Uy( x) > log Zt, k=1,...,m. kaj. 
J 


Theorem 6.8.1. Jf g, is the a priori probability of drawing an observation 
from 7, = N(w, 2), i= 1,...,m, and if the costs of misclassification are equal, 
then the regions of classification; R,,++-,R,,, that minimize the expected cost are 
defined by (2), where uj,(x) is given by (1). 


It should be noted that each u,,(x) is the classification function related to 
the jth and kth populations, and u(x) = —u,,(x). Since these are linear 
functions, the region R, 1s bounded by hyperplanes. If the means span an 
(m — 1)-dimensional hyperplane (for example, if the vectors uw“ are linearly 
independent and p >m— 1), then R, is bounded by m— 1 hyperplanes. 

In the case of no set of a priori probabilities known. the region R, is 
defined by inequalities 


(3) U(X) BC, — Cy, kK=1,...,m, k#¥]. 


238 CLASSIFICATION OF OBSERVATIONS 


The constants c, can be taken nonnegative. These sets of regions form the 
class of admissible procedures. For the minimax procedure these constants 
are determined so all P(ili, R) are equal. 

We now show how to evaluate the probabilities of correct classification. If 
X is a random observation, We consider the random variables 


(4) U, = [X— 5(w? + ery] S71 ( pw? ~ pO). 


Here U,= —U,. Thus we use m(m—1)/2 classification functions if the 
means span an (m — 1)-dimensional hyperplane. If X is from 7,, then U;, is 
distributed according to N(;4°,, 4°,), where 


(5) Ne = (po) — MYL ~ pi), 
The covariance of U,, and U, is 
(6) Nikon = (Be — py! Em" (pl? — wl), 


To determine the constants c, we consider the integrals 


(7) PCJ RY = f , sal Fp diy, det, po diy jg °° Ai 


¢ 


where f, is the density of U,,i=1,2,....m,i¥j. 

Theorem 6.8.2. If a, is N(p\", 2%) and the costs of misclassification are 
equal. then the regions of classification, R,,...,R,5 that minimize the maximum 
conditional expected loss are defined by (3), where uj;,(x) is given by (1). The 


constants c, are determined so that the integrals (7) are equal. 


As an example consider the case of m = 3. There is no loss of generality in 
taking p = 2, for the density for higher p can be projected on the two-dimen- 
sional plane determined by the means of the t'1ree populations if they are not 
collinear (i.e., we can transform the vector x into u,,, uy, and p—2 other 
coordinates, where these last p—2 components are distributed indepen- 
dently of u,, and u,, and with zero means). The regions R, are determined 
by three half lines as shown in Figure 6.2. If this procedure is minimax, we 
cannot move the line between R, and R, rearer ( pu, uY), the line between 
R, and Rj nearer (p”, pY?), sau the line between R, and R, nearer 
(un, wY) and still retain the equality P(|1,R)= PC2L2, R)= PG/3, R) 
wailiout leaving a triangle that is not included in any region. Thus, since the 
regions must exhaust the space, the lines must meet in a point, and the 
equality of probabilities determines ¢,—c, uniquely. 


6.8 CLASSIFICATION INTO ONE OF SEVERAL NORMAL POPULATIONS 239 


*2 


gy ,@ ee) 
(uy ue <. 


(nu?) 


Ry 


Figure 6.2. Classification regions. 


To do this in a specific case in which we have numerical values for the 
components of the vectors y'”, x2), x, and the mateix £, we would con- 
sider the three (<p +1) joint distributions, each of two U,,’s (j #i), We 
could try the values of c;=0 and, using tables [Pearson (1931)] of the 
bivariate normal distribution, compute P(i|i, R). By a trial-and-error method 
we could obtain c, to approximate the above condition. 

The preceding theory has been given on the assumption that the parame- 
ters are known. If they are not known and if a sample from each population 
is available, the estimators of the parameters can be substituted in the 
definition of u, (x). Let the observations be x{",..., xf from N(p!, 2), 
i=1,...,m. We estimate yu" by 


(8) i= 


and & by S defined by 


(9) | TN, -m\s- FY (xP — 2) (xl — 2)", 


rei f j=] a=1 
Then, the analog of u;(x) is 
(10) w, (x) _ [x = 5 (xt? + x9] S71 RO — #0), 


If the variables above are random, the distributions are different from those 
of U,;. However, as N, > 00, the joint distributions approach those of U;,. 
Hence, for sufficiently large samples one can use the theory given above. 


240 CLASSIFICATION OF OBSERVATIONS 


Table 6.2 
Mean 

Brahmin Artisan Korwa 
Measurement (7) (m2) (13) 
Stature (x,) 164,51 160.53 158.17 
Sitting height (x) 86.43 81.47 81.16 
Nasal depth (x3) 25.49 23.84 21.44 
Nasal height (x,) 51.24 48.62 46.72 


6.9. AN EXAMPLE OF CLASSIFICATION INTO ONE OF SEVERAL 
MULTIVARIATE NORMAL POPULATIONS 


Rao (1948a) considers three populations consisting of the Brahmin caste 
(7,), the Artisan caste (7,), and the Korwa caste (73) of India. The 
measurements for each individual of a caste are stature (x,), sitting height 
(x,), nasal depth (x,), and nasal height (x,). The means of these variables in 
the three populations are given in Table 6.2. The matrix of correlations for 
all the populations is 


1.0000 0.5849 0.1774 0.1974 
(1) 0.5849 1.0000 0.2094 0.2170 

0.1774 0.2094 1.0000 0.2910 |’ 

0.1974 0.2170 0.2910 1.0000 


The standard deviations are ao, = 5.74, a, = 3.20, 03 =1.75, o4 = 3.50. We 
assume that each population is normal. Our problem is to divide the space of 
the four variables x,, x ,%3,x, into three regions of classification. We 
assume that the costs of misclassification are equal. We shall find (1) a set of 
regions under the assumption that drawing a new observation from each 
population is equally likely (q, =q)=q3 = 4), and (ii) a set of regions such 
that the largest probability of misclassification is minimized (the minimax 
solution). 


We first compute the coefficients of £7! (ul) — x) and E>"? — we). 
Then 27 '(p? — pp) = E71 — pO) — Eo Gu? — pw). Then we calcu- 
late $(u? + po?) E-!Cu? — pw), We obtain the discrimmant functions’ 


Myo(x) = —0.0708x, + 0.4990x» + 0.3373x, + 0.0887x, — 43.13, 
(2) uy(x) = 0.0003x, + 0.3550x5 + 1.1063x, + 0.1375x, — 62.49, 
Uos(X) = 0.071 1x, — 0.1440x, + 0.7690x, + 0.0488, — 19.36. 


Due to an error in computations, Rao’s discriminant functions are incorrect. I am indebted to 
Mr. Peter Frank for assistance in the computations. 


6.9 AN EXAMPLE OF CLASSIFICATION 241 


Table 6,3 
Standard 
Population of x u Means Deviation Correlation 
7, My. 1.49] 1.727 
As 3.487 2.641 ueeee 
Ty ley 1.491 1727 _ 
i 1.031 1.436 ee 
Ty Way 3,487 2.641 
The other three functions are u(x)= —uj,(x), uy(x)= —u,,(x), and 
U4)(xX) = ~Uy,(x). If there are a priori probabilities and they are equal, the 


best set of regions of classification are R,: uj.(x)>=0, u4,(x)>0; R: 
U(x) = 0, up,(x)>0; and Ry: us(x)>0, u(x) > 0. For example, if we 
obtain an individual with measurements x such that u,(x) > 0 and u,,(x) > 
0, we classify him as a Brahmin. 

To find the probabilities of misclassification when an individual is drawn 
from population 7, we need the means, variances, and covariances of the 
proper pairs of u’s. They are given in Table 6.3. 

The probabilities of misclassification are then obtained by use of the 
tables for the bivariate normal distribution. These probabilities are 0.21 for 
71, 0.42 for m,, and 0.25 for 7. For example, if measurements are made on 
a Brahmin, the probability that he is classified as an Artisan or Korwa is 0.21. 

The minimax solution is obtained by finding the constants c,, c,. and c; 
for (3) of Section 6.8 so that the probabilities of misclassification are equal. 
The regions of classification are 


Ri upy(x) = 0.54, W(x) 2 0.29; 
(3) Rai uy(x) = —0.54, Un,(X) = —0.25; 
R3.us)(x) 2 --0.29, ey (x) > 0.25. 


The common probability of misclassification (to two decimal places) is 0.30. 
Thus the maximum probability of misclassification has been reduced from 
0.42 to 0.30. 


tSome numerical errors in Anderson (1951a) are corrected in Table 6.3 and (3). 


242 CLASSIFICATION OF OBSERVATIONS 


6.10. CLASSIFICATION INTO ONE OF TWO KNOWN MULTIVARIATE 
NORMAL POPULATIONS WITH UNEQUAL COVARIANCE MATRICES 


6.10.1. Likelihood Procedures 


Let 7, and 7, be N(p,2,) and N(p®, S,) with p + w® and ¥, # Ep. 
When the parameters are known, the likelihood ratio is 


ra li Ca? or Cea 
P(x) |X] exp[—$(2 — py dz! (x — pw) 


(1) 


= (2, /412,| “Fexp[4(x ~ pOy Ez !(x— pl) 
—He~ WYETH WO), 


The logarithm of (1) is quadratic in x. The probabilities of misclassification 
are difficult to compute. [One can make a linear transformation of x so that 
its covariance matrix is J and the matrix of the quadratic form is diagonal; 
then the logarithm of (1) has the distribution of a linear combination of 
noncentral y?-variables plus a constant.] 

When the parameters are unknown, we consider the problem as testing 
the hypothesis that x, x{?,...,x{/) are observations from N(w‘?,%,) and 


x©)...,x© are observations from N(w?,%,) against the alternative that 
x{”,....xQ) are observations from N(mw"?, Z,) and x, x{”,..., x? are obser- 


vations from N(w, £5). Under the first hypothesis the maximum likelihood 
estimators are pl? =(N, x" +x)/(N, + 1), §? =x, 


; 1 N , 
2 (1) = Ni +1 A, + Noe (*~ EO) (x2) | 


where A, = DN (xf) — 2) —#M)', i= 1,2. Gee Section 6.5.5.) Under 
the second hypothesis the maximum likelihood estimators are fi? = x“), 
py = (NF +2)/(N, +1), 


a 1 
£2) =i, 


7 N 2 =(2)\: 
%+(2) = N,+1 A, + Nai —x)(x — x%) 


6.10 POPULATIONS WITH UNEQUAL COVARIANCE MATRICES 243 


The likelihood ratio criterion is 


$ $ i = t4a- se R(Ny + 
[3 (1)] OM #1 eM [1 + (x — BOY) APH ~ RDP MEY 


(4) 
Ni™ P(N, + 1) A2# P14 | A 


The observation x is classified into 7, if (4) is greater than 1 and into 7, if 
(4) is less than 1. 


An alternative criterion is to plug estimates into the logarithm of (1). Use 
(5) BC BP Y'S71( 2) — (x 20)'S7 12 2)] 


to classify into 7, if (5) is large and into 77, if (5) is small. Again it is difficult 
to evaluate the probabilities of misclassification. 


6.10.2. Linear Procedures 


The best procedures when {, # X, are not linear; when the parameters are 
known, the best procedures are based on a quadratic function of the vector 
observation x. The procedure depends very much on the assumed normality. 
For example, in the case of p=1, the region for classification with one 
population is an interval and for the other is the complement of the interval 
—that is, where the observation is sufficiently large or sufficiently small. In 
the bivariate case the regions are defined by conic sections; for examole, the 
region of classification into one population might be the interior of an ellipse 
or the region between two hyperbolas. In general, the regions are defined by 
means of a quadratic function of the observations which is not necessarily a 
positive definite quadratic form. These procedures depend very much on the 
assumption of uormality and especially on the shape of the normal distribu- 
tion far from its center. For instance, in the univariate case cited above the 
region of classification into the first population is a finite interval because the 
density of the first population falls off in either direction more rapidly than 
the density of the second because its standard deviation is smaller. 

One may want to use a classification procedure in a situation where the 
two populations are centered around different points and have different 
patterns of scatter, and where one considers multivariate normal distribu- 
tions to be reasonably good approximations for these two populations near 
their centers and between their two centers (though not far from the centers, 
where the densities are small), In such a case one may want to divide the 


244 CLASSIFICATION OF OBSERVATIONS 


sample space into the two regions of classification by some simnple curve or 
surface. The simplest is a line or hyperplane; the procedure may then be 
termed linear. 

Let b (#0) be a vector (of p components) and c a scalar. An observation 
x is Classified as from the first population if b’x = c and as from the second if 
b'x<c. We are primarily interested in situations where the important 
difference between the two populations is the difference between the cen- 
ters; we assume pf? + pw as well as ©) # >, and that £, and =, are 
nonsingular. 

When sampling from the ith population, b‘x has a univariate normal 
distribution with mean &(b'x|7,) = b’p and variance 


(6) &(b'x — bp!?)'| mw; = Eb'(x — w?)( x — p)'b| 7; = bE, B. 


The probability of misclassifying an observation when it comes from the first 
population is 
n| 


The probability of misclassifying an observation when it comes from the 
second population is 


b'x — pp c= bp) 
(b'S,b)? (B'S ,b)? 


a l) 
#2 oS ‘|. 
(b'%,b) 


(7) P(2|1) = Pr{b’x<cla}= P| 


— hig 
-o| c— bhp ; 
(b'S,)' 


b'x—b'p® | c- b'p® 


(8) P(A2)= Ps xcs) = P| (6°E,b)* ~ (6'E20)'| 


— hla 
~1-0( SoH 


It is desired to make these probabilities small or, equivalently, to make the 
arguments 


(9) oe bp?) —¢ ‘ier c— bp? 
(B'S byE?* (BEB)? 


large. We shall consider making y, large for given y). 
When we eliminate c from (9), we obtain 


(10) y= [b’y ~y2(b'S2b)*](b'Eb) 


6.10 POPULATIONS WITH UNEQUAL COVARIANCE MATRICES 245 


where y = uw — ew. To maximize y, for given y, we differentiate ¥, with 
respect to b to obtain 


dy -4 ~4 
(11) 1 =| y ~ya(b'E2b)” *E5b](b°E BY 
—[b'y ~y,(b'E2b)"|(b'Z,b) ZB. 
If we let 
_ by —y,(b'E:b)? 
(12) a be anes, 
(13) jai oe 3 


2 fb'd Bb 
then (11) set equal to 0 is 
(14) (82, +6,2,)b=y¥. 


Note that (13) and (14) imply (12). If there is a pair ¢,,f,, and a vector b 
satisfying (12) and (13), then ¢ is obtained from (9) as 


(15) C= yyyb'E_b +b'p? =1,b'L,b + bp. 
Then from (9), (12), and (13) 
_ bi — (bb + bp?) _ 


(16) a iS fyyb'd,b. 


Now consider (14) as a function of t (Q<1< 1). Let t, =f and t,=1—-1; 


then b=(1,2,+t,2,)7'y. Define v, =1,/b'2,b and v,=1t,/b'2,b. The 
derivative of vy with respect to ¢ is 


Qn Fey'[k, + nk) B PE, +98) 9 


= 2ry'[t2,+ (1 -2)2,] 2, [62,+ 0-92.) 'y 
—fy'[tX,+(1-2)22] (2, -2,)[t2,+ 1-2) 2,] 
-E, [12,4 (1-2)22] yy 
—ty'[td, + (1-2) E,) 12, [22,4 (1-1) 22]! 
(2, -2,)[f2,+ (1-122) Vy 
=ty'[#E, + (1c) 2.) (2, [12, + (1-1) E] "YE, 
+¥,[¢E,+(1-c)E,] 12, }[e2, + (1-2) 2] Vy 


by the following lemma. 


246 CLASSIFICATION OF OBSERVATIONS 


Lemma 6.10.1. /f &, and 2%, are positive definite and t, > 0,t, >0, then 
(18) Y[1,2,+4%.]7'S, 
iy positive definite. 

Proof. The matrix (18) ts 


1 


(19) Dae PPE eee a > ap pa hea (nae a Pp ea ie ia 


Similarly dvj/dt <0. Since v, >0, v,>0, we see that v,; increases with t 
from 0 at t=O to yy'Z,'y at ¢=1 and v, decreases from yy'Xz'y at 
:=(Q to 0 at 7= 1. The coordinates v, and v, are continuous functions of f. 
For given y,,0<y)< Vy'Zs'y, there is a ¢ such that y, =v, =¢,/b' Xb 
and b satisfies (14) for t, =¢ and t,=1—¢. Then y, =v, =¢,yb'2,b maxi- 


mizes y, for tl.at value of y,. Similarly given y,,0<y, < ¥y'=7'y, there is 
az such that y, =v, =¢,/6'2,6 and b satisfies (14) for ¢; =¢ and t, =1—4, 
and y, =v; =t,yb'Z,b maximizes y,. Note that y, > 0, y. >0 implies the 
errors of misclassification are not greater than 3. 

We now argue that the set of y,,y, defined this way correspond to 
admissible linear procedures. Let x,, x, be in this set, and suppose another 
procedure defined by z,,z, were better than x,, X2, that is, x, <z,, x2 <2, 
with at least one strict inequality. For y,=z, let yj be the maximum y, 
among linear procedures, then z,=y,, z.<yz and hence x, <y,, x2 <y%. 
However, this is possible only if x, =y,, x. =y3, because dy, /dy, <0. Now 
we have a Contradiction to the assumption that z,,z, was better than x,, x. 
Thus x,. x; corresponds to an admissible linear procedure. 


Use of Admissible Linear Procedures 

Given t, and ft, such that 1,2,+1,%, is positive definite, one would 
compute the optimum 6 by solving the linear equations (15) and then 
compute c by one of (9). Usually ¢, and f, are not given, but a desired 
solution is specified in another way. We consider three ways. 


Minimization of One Probability of Misctassification for a Specified 

Probability of the Other 

Suppose we are given y, (or, equivalently, the probability of misclassification 
when sampling from the second distribution) and we want to maximize y, 
(or. equivalently, minimize the probability of misclassification when sampling 
from the first distribution). Suppose y,>0 (.e., the given probability of 
misclassification is less than +), Then if the maximum y, > 0, we want to find 
t,=1—1, such that y, =¢,(b'E,b)?, where b = [t,2, +t, 2,17! y. The solu- 


6.10 POPULATIONS WITH UNEQUAL COVARIANCE MATRICES 247 


tion can be approximated by trial and error, since y, is an increasing function 
of tj. For t, =0, y, =0; and for t, =1, y, =(b'S,b)!=(b'y)' =(y'25'y), 
where £b = y. One could try other values of t, successively by solving (14) 
and inserting in b‘S,b until t,(b'%,b)? agreed closely enough with the 
desired y,. [y, > 0 if the specified y,<(y’Zz!y)2] 


The Minimax Procedure 

The minimax procedure is the admissible procedure for which y, =y,. Since 
for this procedure both probabilities of correct classification are greater than 
4, we have y, =y, > 0 and ¢t, > 0, t, > 0. We want to find ¢ (=t, = 1—t,) so 
that 


(20) O=y?—y2=22b'S,b— (1-1) b' Sb 
=b'[?3,- (1-1) E,|b. 


Since y? increases with ¢ and y3 decreases with increasing ¢, there is one and 
only one solution to (20), and this can be approximated by trial and error by 
guessing a value of t (0<¢<1), solving (14) for b, and computing the 
quadratic form on the right of (20). Then another ¢ can be tried. 

An alternative approach is to set y, =y, in (9) and solve for c. Then the 
common value of y,; =y2 is 


(21) —___77____.,, 
(b's by)? + (b'X 4b)? 


and we want to find b to maximize this, where b is of the form 


(22) [*X,+(1-2)2,] "y 


with 0<¢t<1, 

When %, = %,, twice the maximum of (21) is the squared Mahalanobis 
distance between the populations. This suggests that when %, may be 
unequal to &,, twice the maximum of (21) might be called thie distance 
between the populations. 

Welch and Wimpress (1961) have programmed the minimax procedure 
and applied it to the recognition of spoken sounds. 


Case of A Priori Probabilities 
Suppose we are given a priori probabilities, g, and q,, of the first and second 
populations, respectively. Then the probability of a misclassification is 


(23) q[1-®(y1)] + 92[1- ®(¥2)] =1- [a ®(1) +2.P(2)), 


248 CLASSIFICATION OF OBSERVATIONS 


which we want to minimize. The solution will be an admissible linear 
procedure. If we know it involves y,;>0 and y,>0, we can substitute 
y, =t(b'S,b)? and y,=(1—tXb'E,5)?, where b=[tE,+ 0-03] 'y, 
into (23) and set the derivative of (23) with respect to ¢ equal to 0, obtaining 


d d 
(24) ad(v1) ae + 42(y2) G2 =0, 


where $(u) = (21)~ #e” *. There does not seem to be any easy or direct 


way of solving (24) for t. The left-hand side of (24) is not necessarily 
monotonic. In fact, there may be several roots to (24). If there are, the 
absolute minimum will be found by putting the solution into (23), (We 
remind the reader that the curve of admissible error probabilities is not 
necessary convex.) 

Anderson and Bahadur (1962) studied these linear procedures in general, 
including y,<0O and y,<0. Clunies-Ross and Riffenburgh (1960) ap- 
proached the problem from a more geometric point of view. 


PROBLEMS 


6.1. (Sec. 6.3) Let 7, be N(p, %,), i= 1,2. Find the form of the admissible 
classification procedures. 


6.2. (Sec. 6.3) Prove that every complete class of procedures includes the class of 
admissible procedures. 


6.3. ‘Sec. 6.3) Prove that if the class of admissible procedures is complete, it is 
minimal complete, 


6.4. (See. 6.3) The Neynan-Pearson fundamental lemma states that of all tests at a 
given significance level of the null hypothesis that x is drawn from p,(x) 
against alternative that it is drawn from p(x) the most powerful test has the 
critical region p(x)/p,(x) < k. Show that the discussion in Section 6.3 proves 
this result. 


6.5. (Sec. 6.3) When p(x) =n(x| p,Z) find the best test of p= 0 against p= p* 
at significance level ¢. Show that this test is uniformly most powerful against all 
alternatives wp = cp*, c > 0. Prove that there is no uniformly most powerful test 
against p= gp? and p= wm unless pw) = cy for some c > 0. 


6.6. (Sec. 6.4) Let P(2|1) and P(1|2) be defined by (14) and (15). Prove if 
— 4A* <¢ < 4A%, then P(2|1) and P(1|2) are decreasing functions of A. 


6.7. (Sec. 6.4) Let x’ =(x', x), Using Problem 5.23 and Problem 6.6, prove 
that the class of classification procedures based on x is uniformly as good as 
the class of procedures based on x‘, 


PROBLEMS 249 


6.8. (Sec, 6.5.1) Find the criterion for classifying irises as Jris sefosa or Iris 
versicolor on the basis of data given in Section 5.3.4. Classify a random sample 
of 5 Iris virginica in Table 3.4. 


6.9. (Sec. 6.5.1) Let W(x) be the classification criterion given by (2). Show that the 
T?-criterion for testing N(p!?, Z) = N(w, Z) is proportional to W(¥""’) and 
W(x), 


6.10. (Sec. 6.5.1) Show that the probabilities of misclassification of x,....,xy (all 
assumed to be from either 7, or 7) decrease as N increases. 


6.11. (Sec. 6.5) Show that the elements of M are invariant under the transforma- 
tion (34) and that any function of the sufficient statistics that is invariant is a 
function of M. 

6.12. (Sec. 6.5) Consider d'x. Prove that the ratio 

(d'zg) — aig)" 

Ni 5, : 

Li (dix — at OY + YO (dx —a'x@y 

a=| 


a=] 


6.13. (Sec. 6.6) Show that the derivative of (2) to terms of order 17! is 


1 1 —~1 mid 
~oC3ay(a + [Ea # Pa - Ge"]} 


6.14. (Sec. 6.6) Show &D2? is (4). [ Hint: Let = =J and show that &(S7'|Z=1)= 
[n/n -p- DW] 


6.15. (Sec. 6.6.2) Show 


Z~ ip? Z — 5A° 
Pf su a} —Pr{ A Su 


D 
= 6(0{ one [us +(p—-3)u—- A°u + pal 
1 2 
+ oye +2Au?+(p-3+)u-A) +p] 


1 2 
+4, [30 + 4Au? + (2p—3+ A*)u + 2(p- al) + O(n7*). 


6.16. (Sec. 6.8) Let a, be N(p!?,E), f=1....,o. lf the pl are on a line (ie. 
pw = +,B), show that for admissible procedures the R, are defined by 
parallel planes. Thus show that only one discriminant function u,,(x) need be 
used. 


250 CLASSIFICATION OF OBSERVATIONS 


6.17, (Sec. 6.8) In Section 8.8 data are given on samples from four populations of 
skulls, Consider the first two measurements and the first three samples. 
Construct the classification functions u,(x), Find the procedure for g;= 
N,/CN, +N, + N,). Find the minimax procedure. 


6.18. (Sec. 6.10) Show that b’x=c is the equation of a plane that is tangent to an 
ellipsoid of constant density of 7, and to an ellipsoid of constant density of 2 
at a common point. 


6.19. (Sec. 6.8) Let xY, 0. aK? be observations from Np", E), i= 1,2,3, and let 
2 be an observation to be classified. Give explicitly the maximum likelihood 
rule. 


6.20. (Sec. 6.5) Verify (33). 


CHAPTER 7 


The Distribution of the Sample 
Covariance Matrix and the 
Sample Generalized Variance 


7.1. INTRODUCTION 


The sample covariance matrix, S=[1/(N — 12" _\(x, — x)(x, — x)’, is an 
unbiased estimator of the population covariance matrix &. In Section 4.2 we 
found the density of A =(N—1)S in the case of a 2 x 2 matrix. In Section 
7.2 this result will be generalized to the case of a matrix A of any order. 
When ¥ =J, this distribution is in a sense a generalization of the x ?-distri- 
bution. The distribution of A (or S), often called the Wishart distribution, is 
fundamental to multivariate statistical analysis. In Sections 7.3 and 7.4 we 
discuss some properties of the Wishart distribution. 

The generalized variance of the sample is defined as |S| in Section 7.5; it 
is a measure of the scatter of the sample. Its distribution is characterized. 
The density of the set of all correlation coefficients when the components of 
the observed vector are independcnt is obtaincd in Section 7.6. 

The inverted Wishart distribution is introduced in Section 7.7 and is used 
as an a priori distribution of & to obtain a Bayes estimator of the covariance 
matrix. In Section 7.8 we consider improving on S as an estimator of % with 
respect to two loss functions. Section 7,9 treats the distributions for sampling 
from elliptically contoured distributions. 


An Iniroduction to Multivariate Statistical Analysis, Third Edition, By T, W, Anderson 
ISBN 0-47(-36091-0 Copyright © 2003 John Wilcy & Sons, Inc, 


251 


252 COVARIANCE MATRIX DISTRIBUTION; GENERALIZED VARIANCE 
7.2. THE WISHART DISTRIBUTION 


We shall obtain the distribution of 4=L%_(x,-X}X,—X)', where 
X\,---»Xy (N > p) are independent, each with the distribution N(p, 2). As 
was shown in Section 3.3, A is distributed as L7_,Z,Z,, where n=N—1 


and Z,,...,Z, are independent, each with the distribution N(0, %). We shall 
show that the density of A for A positive definite is 


[A 2"-P- exp( — dtr B7'4) 
iP P(P-D/4) SITE [k(n +1-i)] 


(1) 


We shall first consider the case of > = J. Let 


U 


Py 


(2) (Z,,--.,2,) = 

Y, 
Then the elements of A =(a,,) are inner products of these n-component 
vectors, @;; = u,v; The vectors V\,--.,U, are independently distributed, each 
according to N(0, J,). It will be convenient to transform to new coordinates 
according to the Gram—Schmidt orthogonalization. Let w, = vj, 


il olp 
(3) Ww, = 0; ree i=2,...,p. 
j=l id 


We prove by induction that w, is orthogonal to w;, k <i. Assume ww, = 0, 
k#h, k,h=1,...,i—1; then take the inner product of w, and (3) to obtain 
wiw, =0, k=1,...,i-1. (Note that Prfllw,|| = 0} = 0.) 

Define ¢,, = lll = wiw,, i= 1,...,p, and t, =vjw,/llwl, j= 1,...,7- 1, 
i=2,,..,p. Since v, = Li_,(t,,/llwDw,, 


min(A, i) 


(4) a,, = v,Y; = Dy En jeyj- 
fel 


If we define the lower triangular matrix T = (¢,,) with ¢,,>0,i=1,..., p, and 
t,, = 0, i<j, then 


(5) A=TT', 


Note that 4, /=1,...,i-1, are the first i—1 coordinates of v, in the 
coordinate system with w),...,w,., as the first /—1 coordinate axes. (See 
Figure 7.1.) The sum of the other n —i +1 coordinates squared is |lz,|l? -- 
Vita = llw,|I?; w, is the vector from v, to its projection on w,,.--,w;— 
(or equivalently on v,,..-,¥;-;)- 


7,2 THE WISHART DISTRIBUTION 253 


Figure 7,1. Transformation of cocrdinates, 


Lemma 7.2.1. Conditional on w,,...,,—, (or equivalently on v,,....0,-4)s 
tay-e-stj,-1 and t7, are independently distributed; t,, is distributed according to 
NO, 0), i> jf; and t2 has the ~?-distribution with n —i+ 1 degrees of freedom. 


Proof. The coordinates of v, referred to the new orthogonal coordinates 
with »,...,%,-, defining the first coordinate axes are independently nor- 
mally distributed with means 0 and variances 1 (Theorem 3.3.1). ¢;, is the sum 
of the coordinates squared omitting the first /— 1. a 


Since the conditional distribution of ¢,,,...,¢,, does not depend on 
V},-.-,0;_,, they are distributed independently of 1,,,t3,,to.-.-steaien 


Corollary 7.2.1. Let Z,,...,Z, (2 2p) be independently distributed, each 
according to N(O, 1); let A=L3.,2,Z, = 1T', where t, =0, i<j, and t,,>0, 
i=1,.,.,p. Then t\,ty,--.,t,, are independently distributed, t,, is distributed 
according to N(O,1), i> f; and t2 has the y*-distribution with n ~i+ 1 degrees 


of freedom. 


Since t,, has density 27 #"-'"Y ene" 2° /T[L(n + 1 — 2D), the joint density 
of tiny PSH Teen gal; i=1,...,), is 
Z tn exp(— 4) .127) 
=p m2 IPT (n+ 1 -i)] 


Gere exp( — phen pe t) 


(6) 


t 


= 21 
eT bs pce a 7 . 
Qe Mee POAT ET [E(n +1 ~H)) 


254 COVARIANCE MATRIX DISTRIBUTION; GENERALIZED VARIANCS 


Let C be a lower triangular matrix (c,, = 0, i<j) such that 2 = CC’ and 
c, > 0. The linear transformation T* = CT, that is, 


i 

(7) i= DY Cates ; iz], 
k=, 

=, i<j, 


can be written 


ae cy, OO O 0 0 ty 

ty A typ 0 0 0 {I 4, 

t% nk 265 0 QO |} t 
(8) eee Eg ll 

my x -% o% ope 0 ta 

a yk x vee a 4 t 

bp z pp || ‘pp 


where x denotes an element, possibly nonzero. Since the matrix of the 
transformation is triangular, its determinant is the product of the diagona! 
elements, namely, [1?_,c},. The Jacobian of the transformation from T to T* 
is the reciprocal of the determinant. The density of T* is obtained by 
substituting into (6) ¢,, = t*/c,, and 


P t 
(9) » Laur! 


=irC-'T*T*"(C71)' 
=trT*T*'c'"!c"! 
Str TS) StrT ST. 
and using I1?.,¢, =IClIC’| =|=1. 
Theorem 7.2.1. Let Z,,...,Z, (n= =i be independently distributed, each 


according to N(0,%); let A= EZ aZ,; and let A=T*T*', where t¥. =0, 
i<j, and t* >Q. Then the density of T* is 


Te. e*n- e7redcl Trt: 
mUu 


10 ek 
(19) 2-Day P-D/ATS| PTT [A (n + 1—Z)] 


7.2 THE WISHART DISTRIBUTION 255 
We can write (4) as @,,= Li.yta/t% for h >i. Then 


O4a;; 
11 +~=( k>h, 
(11) otf, ; 


=0, k=h, [> i; 


that is, da,,/ ott, =O if k,/ is beyond A,i in the lexicographic ordering. The 
Jacobian of the transformation from A to T* is the determinant of the lower 
triangular matrix with diagonal elements 


(12) Bek, 7 Ahn» 
(13) a tH h>i, 
hi 


The Jacobian is therefore 2°T1? ,¢*?*'~". The Jacobian of the transforma- 


‘ 
tion from T* to A is the reciprocal. 


Theorem 7.2.2. Let Z,,...,Z, be independently distributed, each according 
to N(O, X%). The density of A= L"_,Z, Zi, is 


|A| 2017 P- Dg ae x74 


14 $a 
(14) Qwrg PPO S/ TTP [An +1 —i)| 


for A positive definite, and 0 otherwise. 


Corollary 7.2.2. Let X,,...,Xy (N > p) be independently distributed, each 
according to N(p, %). Then the density of A= LN_(X, -XXX,—-XY is 14) 
forn=N-—-1, 


The density (14) will be denoted by w(A|%, 2), and the associated distri- 
bution will be termed W(X, n). If n <p, then A does not have a density, but 
its distribution is nevertheless defined, and we shall refer to it as W(X, n). 


Corollary 7.2.3. LetX,,...,X,y (N>p) be independently distributed, each 
according to N(w, %). The distribution of S =(1/n)LN_(X, —-X)X, -X)' is 
W[W/n)%, n], wheren=N—1. 


Proof. S has the distribution of D".,[d/vn)Z,IG/vn)Z,]', where 
(1/vn)Z,,...,1/v¥n)Zy are independently distributed, each according to 
N(O,(1/n)=). Theorem 7.2.2 implies this corollary. a 


256 COVARIANCE MATRIX DISTRIBUTION; GENERALIZED VARIANCE 


The Wishart distribution for p = 2 as given in Section 4,2,1 was derived by 
Fisher (1915), The distribution for arbitrary p was obtained by Wishart 
(1928) by a geometric argument using v,,...,¥, defined above. As noted in 
Section 3.2, the ith diagonal element of A is the squared length of the ith 
vector, @,, = uu, = lle ll’, and the i, jth off-diagonal element of A is the prod- 
uct of the lengths of v, and v, and the cosine of the angle between them. The 
matrix A specifies the lengths and configuration of the vectors. 

We shall give a geometric interpretation’ of the derivation of the density 
of the rectangular coordinates ¢,,, i> j, when & = J. The probability element 
of ¢,, is approximately the probability that ||v,|| lies in the interval ¢,, <llo,|l 
<t,, + dt,,. This is the probability that v, falls in a spherical shell in n 
dimensions with inner radius ¢,, and thickness dt,,. In this region, the density 
(22r)-  exp(— 4v'v,) is approximately constant, namely, (27r)~ ?” exp(— 422). 
The surface area of the unit sphere in n dimensions is C(x) = 27#"/T'(hn) 
(Problems 7,1-7.3), and the volume of the spherical shell is approximately 
C(n)ti"' dt,,. The probability element is the product of the volume and 
approximate density, namely, 


(15) Qt! exp( - 37, ) dt, /T'(4n). 


The probability element of ¢,,...,4,;-154,; given v),...,4,-, (ie, given 
W,,+-+>4_,) IS approximately the probability that v, falls in the region for 
which t, <yw)/llwyll<t) + dts... ti <4 Al, ll <4,,-1 + atin 
and t,, <|lw,|| <¢,;—dt;,, where w, is the projection of v, on the (n —i + 1)- 
dimensional space orthogonal to w,,...,w;.,. Each of the first 7-1 pairs of 
inequalities defines the region between two hyperplanes (the different pairs 
being orthogonal), The last pair of inequalities defines a cylindrical shell 
whose intersection with the (i—1)-dimensional hyperplane spanned by 
V},+-+)%-, 18 a spherical shell in » —i+1 dimensions with inner radius ¢,,. 
In this region the density (27)~ * exp(— 4v/v,) is approximately constant, 
namely, (27r)~ 2" exp(— 22) .,¢/;). The volume of the region is approximately 
dt, +++ dt,;_, C(n—i+1)t?~' dt,,. The probability element is 


Aa({n- Da a! exp(—} i ti) 


(16) ris(n+1-i)| — 


ty 7 de 


ue 


Then the product of (15) and (16) for = 2,..., p is (6) times dt,, ++ de,,. 
This aralysis, which exactly parallels the geometric derivation by Wishart 


[and later by Mahalanobis, Bose, and Roy (1937)], was given by Sverdrup 


‘In the first edition of this book, the derivation of the Wishart distribution and its geometric 


Interpretation were in terms of the nonorthogonal vectors ),..., 2p). 


7.2 THE WISHART DISTRIBUTION 257 


(1947) [and by Fog (1948) for p= 3], Another method was used by Madow 
(1938), who drew on the distribution of correlation coefficients (for X = 1) 
obtained by Hotelling by considering certain partial correlation coefficients. 
Hsu (1939b) gave an inductive proof, and Rasch (1948) gave a method 
involving the use of a functional equation. A dilicrent method is to obtain the 
characteristic function and invert it, as was done by Ingham (1933) and by 
Wishart and Bartlett (1933), 

Cramér (1946) verified that the Wishart distribution has the characteristic 
function of A. By means of alternative matrix transformations Elfving (1947). 
Mauldon (1955), and Olkin and Roy (1954) derived the Wishart distribution 
via the Bartlett decomposition; Kshirsagar (1959) based his derivation on 
random orthogonal transformations. Narain (1948), (1950) and Ogawa (1953) 
uscd a regression approach. James (1954), Khatri and Ramachandran (1958), 
and Khatri (1963) applied different methods. Giri (1977) used invariance. 
Wishart (1948) surveyed the derivations up to that date. Some of these 
methods are indicated in the problems, 

The relation A= TT’ is known as the Bartlett decomposition [Bartlett 
(1939)], and the (nonzero) elements of T were termed rectangular coordinates 
by Mahalanobis, Bose, and Roy (1937), 


Corollary 7.2.4 


P 
a7) fo fIBIT Be 8 apm gor FUT LF), 
B>dO r=] 


Proof. Here B > 0 denotes B positive definite. Since (14) is a density, its 
integral for A >0.8 1. Let 2 =7, A=2B (dd = 2 dB), and n = 2t, Then the 
fact that the integral is 1 is identical to (17) for ¢ a half integer. However, if 
we derive (14) from (6), we can let n be any real number greater than p — 1. 
In fact (17) holds for complex ¢ such that “2t>p— 1, (.4t means the real 
part of ¢.) | 


Definition 7.2.1. The multivariate gamma function is 
pP 
(18) T(t) = Perv Tir ls— 3(¢— 1)]. 


The Wishart density can be written 


[|A| StrP- Bg Ate 


(19) Are) ~ 2ST Gn) 


avV ee Co ot ere Ste ec eto. cee ne a iT ary 


7.3. SOME PROPERTIES OF THE WISHART DISTRIBUTION 


7.3.1. The Characteristic Function 


The characteristic function of the Wishart distribution can be obtained 
directly from the distribution of the observations. Suppose Z,,...,Z, are 
distributed independently, each with density 


1 
1 ————— exp( — 3z’"'z). 
m Guyer rt 4) 
Let 
(2) As. }) ZZ. 


Introduce the p Xp matrix © = (0,,) with 9,,= 0,. The characteristic func- 


tion of Ay, Ay,+.+1Apgs2Aj22Aj35---,2Ap-), p IS 


(3) & exp{itr(A@)] = f exp[it > 2,2,0| 


a=l1 


= é exp{it > 7,02,| 


ae] 


= 5 exp Y 7,02,). 


a=] 


It follows from Lemma 2.6.1 that 
nN nr 
(4) s exp a 7,02, = [] &exp(iZ,@Z,) =[& exp(iz’OZ)]", 
a=] a=] 


where Z has the density (1). For @ real, there is a real nonsingular matrix B 
such that 


(5) B'S 'B=I, 
(6) B'OB =D, 


where D is areal diagonal matrix (Theorem A.2 ,2 of the Appendix), If we let 
z = By, then 


(7) & exp(iZ'@Z) = & exp(iY’DY) 


p 
~ &[ [exp(id,,¥,’) 


pol 


P 
cz I] é exp(id,,¥/) 


73 SOME PROPERTIES OF THE WISHART DISTRIBUTION 259 


by Lemma 2.6.2. The jth factor in the second product is & exp(id,,Y,”), 


where Y, has the distribution N(O, 1); this is the characteristic function of the 
x?-distribution with one degree of freedom, namely (1 — 2id,,)~ ? [as can be 
proved by expanding exp(id,,y?) in a power series and integrating term by 
term]. Thus 


P = ye ds 
(8) praneO2) = Ete 27) "= |7- 2iD|"? 


since J — 2iD is a diagonal matrix. From (5) and (6) we see that 


(9) |1— 2iD| =|B’S"'B— 2iB'@B| 
= |B'( 27! - 2i@) Bl 
= |B’|-|27' — 2i0| -|Bl 
= |B|*-|27' -— 2:0], 


[B’|-|27!| 1B] =|] =1, and |B|?=1/]27'|. Combining the above re- 
sults, we obtain 


[71 ‘asl! 
(10) & exp[itr(A@)] = | =|f-210 |". 


>"! —2i0|” 


It can be shown that the result is valid provided (A(a'* — 2164.)) iS positive 
definite. In particular, it is true for all real @. It also holds for & singular. 


Theorem 7.3.1. If Z,,....Z, are independent, each with distribution 
N(O,%), then the characteristic function of A,,,.-., App,2A\2s..-,2A 
where (A,,) =A = L}-,2,2,, is given by (10). 


Poi, Pp? 


7.3.2. The Sum of Wishart Matrices 


Suppose the A,, i= 1,2, are distributed independently according to W(X, n,), 
respectively. Then A, is distributed as L73_,Z, Zi, and A, is distributed as 


een ZaZy, where Z,,...,2Z, ,,, are independent, each with distribution 
NO, >). Then A =A, +A, is distributed as £7_,Z,Z,, where n=n, +n). 


Thus A is distributed according to W(X, n). Similarly, the sum of g matrices 
distributed independently, each according to a Wishart distribution with 
covariance &, has a Wishart distribution with covariance matrix & and 
number of degrees of freedom equal to the sum of the numbers of degrees of 
freedom of the component matrices. 


260 COVARIANCE MATRIX DISTRIBUTION; 3ENERALIZED VARIANCE 


Theorem 7.3.2. If A,,...,A, are independently distributed with A, dis- 
tributed according to W(X, n,), then 


(11) A= LA; 


is distributed according to W(X, L7_,n,). 


73.3. A Certain Linear Transformation 


We shal: frequently make the transformation 
(12) A=CBC', 


where C is a nonsingular p Xp matrix. If A is distributed according to 
W(%,n), then B is distributed according to W(®, mn) where 


(13) &=c Sc", 


This is proved by the following argument: Let A= 1(.,Z,2Z,, where 
Z,,-..,Z, are independently distributed, each according to N(0, =). Then 
Y, = C7 'Z,, is distributed according to N(0, ®). However, 


n n 
(14) B= ViyYy=c! ¥ zZ,z.c'"'=c lac’! 
a=] 


a= 


is distributed according to W(®, n). Finally, |@(A)/@(B)|, the Jacobian of 
the transformation (12), is 


a(A _ ¥(B8,@,n) _ |B 2@-P-1)] yn _ re 
a om - w(A, &, 7) ~ a Keme- apie modi cl? 


Theorem 7.3.3. The Jacobian of the transformation (12) from A to B, where 
A and B are symmetric, is mod|C|?*". 


7.3.4. Marginal Distribntions 


If A is distributed according to W(X, n), the marginal distribution of any 
arbitrary set of the elements of A may be awkward to obtain. However, the 
marginal distribution of some sets of elements can be found easily. We give 
some of these in the following two theorems. 


73 SOME PROPERTIES OF THE WISHART DISTRIBUTION 261 


Theorem 7.3.4. Let A and & be partitioned into q and p—q rows and 
columns, 


(16) A= 


If A is distributed according to W(X,n), then Aj, is distributed according to 
W(x Ws n). 


Proof. A is distributed as £2_,Z, Z,, where the Z, are independem, cach 
with the distribution N(0, =). Partition Z, into subvectors of g and p-g 


components, Z, =(Z{)",Z")'. Then Z\”,...,Z\ are independent, each 
with the distribution N(@, %,,), and A,, is distributed as £2_, ZZ", which 
has the distribution W(X, ,, 7). a 


Theorem 7.3.5. Let A and & be partitioned into p,,p.,..., p, rows and 
columns (p, + --- +p, =p), 


A), sah A,, 243 eid a, 
(17) A=| - Y= 


If %,,=0 for i#j and if A is distributed according to W(X,n), then 
Ay,,A,--., 4,4 are independently distributed and A.,, is distributed according to 
W(%,,, n). 


ip 


Proof. A is distributed as L4,.,Z,Z',, where Z,,...,2Z, are independently 


a“ 


distributed, each according to N(O,%). Let Z, be partitioned 


zi 
(18) Z,=| : 


a 


Zn 


as A and & have been partitioned. Since %,,=0, the sets 2Z{”,..., 
Z),...,Z0,...,Z@ are independent. Then A,, = D2. ,Z2)9Z0"",....4,,= 
ur ZZ" are independent. The rest of Theorem 7.3.5 follows from 
Theorem 7.3.4. a 


262 COVARIANCE MATRIX DISTRIBUTION; GENERALIZED VARIANCE 


7.3.5. Conditional Distributions 


In Section 4.3 we considered estimation of the parameters of the conditional 
distribution of X“ given X® =x. Application of Theorem 7.2.2 to Theo- 
rem 4.3.3 yields the following theorem: 


Theorem 7.3.6. Let A and % be partitioned into q and p-q rows and 
columms as in (16). If A is distributed according t: W(%,n), the distribution of 


Note that Theorem 7.3.6 implies that A,,,, is independent of A,, and 
A,,.Aj3; regardless of £. 
7.4. COCHRAN’S THEOREM 
Cochran's theorem [Cochran (1934)] is useful in proving that certain vector 
quadratic forms are distributed as sums of vector squares. It is a statistical 


statement of an algebraic theorem, which we shall give as a lemma. 


Lemma 7.4.1. [f the N X N symmetric matrix C, has rank r,, i=1,. 
and 


m 


baa le ’ 


my 


(1) UC =k, 
re] 

then 

(2) SN 


is a necessary and sufficient condition for there to exist an N XN orthogonal 
matiix P such that fori=1,...,m 

0 0 (0 

0 TF Of, 


0 0 0 


(3) PC,P' = 


where I is of order r,, the upper left-hand 0 is square of order Deal (which is 
vacuous for i = \), and the lower-right hand 0 is square of order Leia t; (which 


is vacuous for i=m). 


Proof. The necessity follows from the fact that (1) implies that the sum of 
(3) over i= 1,...,m is I. Now let us prove the sufficiency; we assume (2). 


7.4 COCHRAN’S THEOREM 263 


There exists an orthogonal matrix P, such that P.C;P’ is diagonal with 
diagonal elements the characteristic roots of C,. The number of nonzero 


roots is rs the rank of C,, and the number of 0 roots is N — Ys We write 
0 0 90 
(4) P, C,P! = 1/0 A 0 
0 oO 0 


o) 


where the partitioning is according to (3), and A, is diagonal of order r,. This 
is possible in view of (2). Then 


ms I 0 0 
(5) PLGH-Ri-c)r=|0 1-4, 6 
asl 0 o Tf 


Since the rank of (5) is not greater than D7! ,7,—r, = N—+r,, which is the sum 
of the orders of the upper Icft-hand and lower right-hand I's in (5), the rank 
of I~ A, isQ and A, =. (Thus the +, nonzero roots of C, are 1, and C, is 
positive semidefinite.) From (4) we obtain 


P=B B 


{ Jone 


0 0 0 
(6) C=P'|0 1 0 
0 0 0 


where B, consists of the r, columns of P’ corresponding to J in (6). From (1) 
we obtain 


Bi 
ua By 
(7) T= YB Bi=(B,,By,...,B,)| - | =P'P, 
y=1 ; 
B., 
where P=(B,, B,,...,B,,)'. a 


We now state 4 multivariate analog to Cochran’s theorem. 


Theorem 7.4.1. Suppose Y,,...,¥y are independently distributed, each ac- 
cording to N(O, Z). Suppose the matrix (c),) = C, used in forming 


N 
(8) Q,= s Cap kata, i=1,..,,m 


a, B=1 


3 


264 COVARIANCE MATRIX DISTRIBUTION; GENERALIZED VARIANCE 


is of rank r,, and suppose 


m N 
(9) y Q; y; it Y . 
i=l ae=l 
Then (2) is a necessary and sufficient condition for Q,,...,Q,, to be indepen- 
dently distributed with Q, having the distribution W(X, r;). 


It follows from (3) that C, is idempotent. See Section A.2 of the Appendix. 

This theorem is useful in generalizing results from the univariate analysis 
of variance. (See Chapter 8.) As an example of the use of this theorem, let us 
prove that the mean of a sample of size N times its transpose and a multiple 
of the sample covariance matrix are independently distributed with a singular 
and a nonsingular Wishart distribution, respectively. Let Y,,...,¥y be inde- 
pendently distributed, each according to N(0, =). We shall use the matrices 
C, = (c%}) = (1/N) and C, = (c¥}) = [6,,— (1/N)]. Then 


(10) Q.= Lo WY. ¥e= NYY’ 


(11) Q, 


ll 
ee 
~ 

BR 
| 
= 
—- 
es 
we 


Me 
sa 
< 
\ 
s 


Il 
Mz 
a 
es 
| 
ml 
——e 
=, 
os 
] 
net 
~ 


and (9) is satisfied. The matrix C, is of rank 1; the matrix C, is of rank N~ 1 
(since the rank of the sum of two matrices is less than or equa] to the sum of 
the ranks of the matrices and the rank of the second matrix is less than N). 
The conditions of the theorem are satisfied; therefore Q, is distributed as 
ZZ', where Z is distributed according to N(0,2), and Q, is distributed 
independently according to W(Z, N — 1). 

Anderson and Styan (1982) have given a survey of pruofs and extensions of 
Cochran’s theorem. 


7.5. THE GENERALIZED VARIANCE 


7.5.1. Definition of the Generalized Variance 


One multivariate analog of the variance o” of a univariate distribution is the 
covariance matrix 2. Another multivariate analog is the scalar | £|, which is 


75 THE GENERALIZED VARIANCE 265 


called the generalized variance of the multivariate distribution [Wilks (1932): 


see also Frisch (1929)]. Similarly, the generalized variance of the sample of 
Vectors X,,...,Xy 1S 


1 N 
(1) |S| = W-1 by (ey) HS) |: 


In some sense each of these is a measure of spread. We consider them here 
because the sample generalized variance wil] recur in many likelihood ratio 
criteria for testing hypotheses. 

A geometric interpretation of the sample generalized variance comes from 
considering the p rows of X=(x,,...,x,) aS p Vectors in N-dimensional 
space. In Section 3.2 it was shown that the rows of 


(2) (x, —-X,...,%y —%) =X—Fe', 


where ¢ =(1,...,1)’, are orthogonal to the equiangular line (through the 
origin and €); see Figure 3.2. Then the entries of 


(3) A=(X~-¥e')(X—¥e')' 


are the inner products of rows of X — Xe’. 

We now define a parallelotope determined by p vectors v,,...,v, in an 
n-dimensional space (m > p). If p= 1, the parallelotope is the line segment 
v,. If p = 2, the parallelotope is the parallelogram with v, and v, as principal 
edges; that is, its sides are v,,v,, v, translated So its initial endpoint ts at v,, 
and v, translated so its initial endpoint is at v',. See Figure 7.2. If p = 3. the 


parallelotope is the conventional parallelepided with v,, v,, and v, as 


Figure 7.2. A parallelogram. 


266 COVARIANCE MATRIX DISTRIBUTION; GENERALIZED VARIANCE 


principal edges. In general, the parallelotope is the figure defined by the 
principal edges r,....,0,. It iS cut out by p pairs of parallel (p—1)- 
dimensional hyperplanes, one hyperplane of a pair being spanned by p — 1 of 


Upside n, and the other hyperplane going through the endpoint of the 
reinaining vector. 


Theorem 7.5.1. [f V=(v,,...,u,), then the square of the p-dimensional 
volume of the parallelotope with v,,...,v, as principal edges is |V'V|. 


Proof. lf p=1, then |V'Vl =viv, =|lo, II’, which is the square of the 
one-dimensional volume of v,. lf two &-dimensional parallelotopes have 
bases consisting of (4 — 1)-dimensional parallelotopes of equal (k — 1) 
dimensional volumes and equal altitudes, their k-dimensional volumes are 
equal [since the A-dimensional volume is the integral of the (k — 1) 
dimensional volumes]. In particular, the volume of a k-dimensional parallelo- 
tope is equal to the volume of a parallelotope with the same base (in k —1 
dimensions) and same altitude with sides in the &th direction orthogonal to 
the first k ~ ] directions. Thus the volume of the parallelotope with principal 
edges r,....,2,. Say P,, is equal to the volume of the parallelotope with 


principal edges ry..... ,_,. Say P,_,, times the altitude of P, over P,_,; 
that is. 
(4) Vol( P,) = Vol( P,_;) X Alt(Pyl Py 1). 


It follows (by induction) that 
(5) Vol{ P,) = Vol( P,) X Alt( P.|P)) X ++ x Alt(P,|P,,). 


By the construction in Section 7.2 the altitude of P, over P,_, is t,, = lw, 
that is, r,, is the distance of 2, from the (k — 1)-dimensional space spanned 
by ey... 2504-4 (OF wy...) _1). Hence Vo?) = Lb poy tyy. Since [VV] = 
\TT'| =T1!,t;, the theorem is proved. a 


We now apply this theorem to the parallelotope having the rows of (2) 
as principal edges. The dimensionality in Theorem 7.5.1 is arbitrary (but at 
Icast p). 


Corollary 7.5.1. The sqitare of the p-dimensional volume of the parallelo- 
tope with the rows of (2) as principal edges is |A|, where A is given by (3). 


We shail see later that many multivariate statistics can be given an 
interpretation in terms of these volumes. These volumes are analogous to 
distances that arise in special cases when p = |. 


74 THE GENERALIZED VARI \NCE 267 


We now consider a geometric interpretation of |A| in terms of N points in 
p-space. Let the columns of the matrix (2) be y,,..., yy, Tepresenting N 
points in p-space. When p = 1, |Al = L,y?,, which is the sum of squares of 
the distances from the points to the origin. In general |A| is the sum of 
squares of the volumes of all parallelotopes formed by taking as principal 
edges p vectors from the set y,,..., ¥y- 


We see that 
ie ae DEV ieNy ina Lia Ys 
a a B 
Ms Aeneas OC ie Deas 
a a B 
ya Vie _ Savina Leyes 
a a B 
Lyi _ Vig iyor a Yip¥pp 
a a 
ao» Eyota OP De Wee epn 
B) « a 
LV Ypa Vita - pain sere Ye 
a a 


by the rule for expanding determinants. [See (24) of Section A.1 of the 
Appendix.] In (6) the matrix A has been partitioned into p—1 and 1 
columns. Applying the rule successively to the columns, we find 


N 
(7) | Al = ys Sd wale 
1 


Oyrcses a,= 


By Theorem 7.5.| the square of the volume of the parallelotope with 
Vor Vye VS SIs as principal edges is 


(8) Vion = | Dataha) 


where the sum on 8 is over (y,,...,¥,). If we now expand this determinant 
in the manner used for |A|, we obtain 


(9) ve peaes Yp ors aar 


268 COVARIANCE MATRIX DISTRIBUTION; GENERALIZED V ARIANCE 


where the sum is for each B; over the range (y,,..., y,). Summing (9) over all 
different sets (y, < --- y,), we obtain (7). (I¥;g ¥,6,1 = 0 if two or more 8, are 
equal.) Thus |A| is the sum of volumes squared of all different parallelotopes 
formed by sets of p of the vectors y, as principal edges. If we replace y, by 


x, —¥, we can state the following theorem: 


Theorem 7.5.2. Let |S| be defined by (1), where x|,.-.,X, are the N 
vectors of a sample. Then |S| is proportional to the sum of squares of the 
volumes of all the different parallelotopes formed by using as principal edges p 
vectors with p of X,,...,X, @s one set of endpoints and & as the other, and the 
factor of proportionality is 1/(N — 1)?, 


The population analog of |S] is |=], which can also be given a geometric 
interpretation. From Section 3.3 we know that 


(10) Pr{X’="'X< y?(a)}=1-a 


if X is distributed according to N(O, =); that is, the probability is 1 — @ that 
X fall inside the ellipsoid 


(11) xXx = y3(a). 


The volume of this ellipsoid is C(p)| £1 4{ x2(a)]” /p, where C(p) is defined 
in Problem 7.3. 


7.5.2. Distribution of the Sample Generalized Variance 


The distribution of |$| is the same as the distribution of |A| /(N — 1)”, where 
A=" .12Z,Z, and Z,,...,Z,, are distributed independently, each according 
to N(0, £), and n= N-—1, Let Z, =CY,, a=1,...,n, where CC’ = 2. Then 
Y,,...,¥, are independently distributed, each with distribution N(O, J). Let 


n mn 
(12) BE OSs C7 2 yc ace"): 


a=] a=i 


then |A| =|C|-|B] -|C’| =|B|-| |. By the development in Section 7.2 we 
see that |Bl has the distribution of [T?.,¢7 and that tf,,...,t2, are indepen- 


dently distributed with y?-distributions. 


Theorem 7.5.3. The distribution of the generalized variance |S| of a sample 
X,,---,Xy from N(p, =) is the same as the distribution of |X| /C(N — 1)? times 
the product of p independent factors, the distribution of the ith factor being the 
y’-distribution with N —i degrees of freedom. 


75 THE GENERALIZED VARIANCE 269 


if p=1, |S| has the distribution of |2|-yj_,/(N— D. If p = 2, |S] has 
the distribution of |2\ yg_1°Xv-2/CN — D*. It follows from Problem 7.15 or 
7.37 that when p = 2, |S| has the distribution of | £1( y?y_4)°/(2N — 2)°. We 
can write 


(13) Al =|] X xpey X XWH2 XX XNRp- 


If p =2r, then |A| is distributed as 


bo ae 
(14) G3] Cae X xiveg X07 X Xin-ar) fee 


Siiice the Ath moment of a y?-variable with m degrees of freedom is 
2°*T(4m + h)/T(4m) and the moment of a product of independent variables 
is the product of the moments of the variables, the Ath moment of |.4| is 


mer font LSND +A] | aye a RTL) +4] 
(15) |2| fe | a (s|* 2 T{i(N-1] 
r,[3(N- 1) +h] 


T[4(N = 1) 


r=] 
= Q"P| |" 
Thus 


Pp 
(16) ala SL LCN Se) 


p 4 P 

(17) W(lAl) = 12 TT(N 2) Irn —j+2)- [IPy-s) , 
i= jl JF 

where %(|A|) is the variance of |A|. 


7.5.3. The Asymptotic Distribution of the Sample Generalized Variance 

Let [Bl /n? =V(n) x V(n) x + x V,(n), where the V's are independently 
distributed and nV{n) = x,;_,,,. Since x,_,,, is distributed as LA2i"' W. 
where the W, are independent, each with distribution N(0,1), the central 
limit theorem (applied to W,) states that 


nVi(n) ~ (m--p +1) pe 


a) : ¥2(n-p Ft) v2 ice 


is asymptotically distributed according to N(O,1). Then vn[V(n)— 1] is 
asymptotically distributed according to N(O,2), We now apply Theorein 4.2.3. 


270 COVARIANCE MATRIX DISTRIBUTION; GENERALIZED VARIANCE 


We have 


Vi(") 1 
(19) U(ny=| = |, b=]: 
V,(n) 1 


[Bl /n? =w=f(uy,...,U,) = Uy,“ u,, T= 20, df/dujl,.5 = 1, and ¢,T¢, 
= 2p, Thus 
(20) vn (3-1) 


P 
is asymptotically distributed according to N(0, 2p). 


Theorem 7.5.4, Let § be a p X p sample covariance matrix with n degrees of 
freedom. Then ¥n(|S|/|Z| —1) is asymptotically normally distributed with 
mean 0 and variance 2p. 


7.6. DISTRIBUTION OF THE SET OF CORRELATION COEFFICIENTS 
WHEN THE POPULATION COVARIANCE MATRIX IS DIAGONAL” 


In Section 4.2.1 we found the distribution of a single sample correlation when 
the corresponding population correlation was zero. Here we shall find the 
density of the set r,, i<j, i,f=1,...,p, when p,, = 0, i<j. 

We start with the distribution of A when % is diagonal; that is, 
WI(o,,8,,), n]. The density of A is 


1in-p-1 1 
la,,|3" e exp(— 422.1 4;;/0,) 


] 1 oT 
= Zor, oF T,(3n) 
since 
a, 0 0 
0 a, - 0 P 
(2) IZi=)- me 
Cp 


We make the transformation 


(3) a,, = V4; ay, Pip i#j, 
(4) G,, = 4,,. 


7.6 DISTRIBUTION OF SET OF CORRELATION COEFFICIENTS 271 


The Jacobian is the product of the Jacobian of (4) and that of (3) for a,, 
fixed. The Jacobian of (3) is the determinant of a p(p ~ 1)/2-order diagonal 
matrix with diagonal elements Van /4;;- Since each particular subscript k, 
say, appears in the set r,, (i<j) p— 1 times, the Jacobian is 


Pp 
(5) J= Ty aze-®. 


If we substitute from (3) and (4) into w[Al(o;;5,;), 2] and multiply by (5), we 


obtain as the joint density of {a,,} and {r,,} 


ae ae 
(6) Van Vary ; xp(— 5 Lf. 1 4;,/ Fr) []a}e-» 
2eeTT off ¥, (37) i=] 
_ Iris aaa que 1 exp(— 34;;/0%) 
naa Lae 
since 


(7) Wau Ya;ny|= fecal 


where r;,= 1. In the ith term of the product on the right-hand side of (6), let 
a,,/(20;;) = u;; then the integral of this term is 


(ems 34;,/ 0, On) 


_ 3 ine —u, Mes 1 
(8) ‘ rr i wey us le du; = 1'(3n) 


by definition of the gamma function (or by the fact that a,,/a,, has the 
x7-density ‘vith n degrees of freedom). Hence the density of r,, is 


re(4n)| rjl d(n~p-1) 


9 aT 
”) (27) 
Theorem 7.6.1. If X,,...,X, are independent, each with distribution 


N[m, (0;,6,;)], then the densi. of the sample correlation coefficients is given by 
(9) where n= N — 1. 


272 COVARIANCE MATRIX DISTRIBUTION; GENERALIZED VARIANCE 


7.7. THE INVERTED WISHART DISTRIBUTION AND BAYES 
ESTIMATION OF THE COVARIANCE MATRIX 


7.7.1. The Inverted Wishart Distribution 


As indicated in Section 3.4.2, Bayes estimators are usually admissible. The 
calculation of Bayes estimators is facilitated when the prior distributions of 
the parameter is chosen conveniently. When there is a sufficient statistic, 
there will exist a family of prior distributions for the parameter such that the 
posterior distribution is a member of this family; such a family is called a 
conjugate family of distributions. In Section 3.4.2 we saw that the normal 
family of priors is conjugate to the normal family of distributions when the 
covariance matrix is given. In this section we shall consider Bayesian estima- 
tion of the covariance matrix and estimation of the mean vector and the 
covariance matrix. 


Theorem 7.7.1. if A has the distribution W(Z,m), then B=A™! has the 
density 


tan ~dontp4t) = teow! 
| wl?" B| en? 


(1) ayn 
22 ’T (4m) 


for B positive definite and 0 elsewhere, where VW = 27 !. 


Proof. By Theorem A.4,6 of the Appendix, the Jacobian of the transfor- 
mation 4 =B™ is |B] ~‘’*, Substitution of B~! for A in (16) of Section 7.2 
and multiplication by | Bl tpl yields (1). a 


We shall call (1) the density of the inverted Wishart distribution with m 
degrees of freedom’ and denote the distribution by W~'(W,m) and the 
density by w~'(B|&,m). We shall call W the precision matrix or concentra- 
tion matrix. 


7.7.2. Bayes Estimation of the Covariance Matrix 


The covariance matrix of a sample of size N from N(p, 2) has the distribu- 
tion of (1/n)A, where 4 has the distribution W(Z,2) and n=N-— 1. We 
shall now show that if 2 is assigned an inverted Wishart distribution, then 
the conditional distribution of 2 given A is an inverted Wishart distribution. 
In other words, the family of inverted Wishart distributions for 2 is conju- 
gate to the family of Wishart distributions. 


*The definition of the number of degrees of freedom differs from that of Giri (1977), p. 104, and 
Muirhead (1982), p. 113. 


7,7 INVERTED WISHART DISTRIBUTION AND BAYES ESTIMATION 273 


Theorem 7.7.2. If A has the distribution W(2,n) and X has the a priori 
distribution W~'(W,m), then the conditional distribution of X is W™'(A+ 
Won+m). 


Proof. The joint density of A and & is 


Iw! im) $| ~alntmep+l) gi Ma-pa1) en re(a+ Ws"! 


(2) (n+m 5 
2% eT Gn)0,(3m) 


for A and positive definite. The marginal density of A is the integral of (2) 
over the set of 2 positive definite. Since the integral of (1) with respect to B 
is 1 identically in W, the integral of (2) with respect to 2 is 


C Lda + my] Lal me Pa a] He 
rGGn)l, (im) — 


for A positive definite. The conditional density of 2 given A is the ratio of 
(2) to (3), namely, 


(3) 


\A OE (2) ® 0 
2i+m er fs(nt+m)| ’ 


(4) 
which is w7'(2]4+ Vn +m). | 


Corollary 7.7.1. If nS has the distribution W(X, n) and & has the a priori 
distribution W-'(W,m), then the conditional distribution of X given S is 
Wl(nS + Win +m). 


Corollary 7.7.2. If nS has the distribution W(X,n), % has the a prior 
distribution W-'(W,m), and the loss function is tr(\D — 2)G(D - 2)H. where 
G and H are positive definite, then the Bayes estimator for & is 


1 


(5) Ramapo ee 

Proof. It follows from Section 3.4.2 that the Bayes estimator for 2 is 
(|S), From Theorem 7,7.2 we see that 2~' has the a posteriori distribu- 
tion W[(nS + W)~',n +m]. The theorem results from the following lemma. 


Lemma 7.7.1. If A has the distribution W(X, 1), then 


1 
Sh ee a 
(6) €A sb ar ‘ 


274 COVARIANCE MATRIX DISTRIBUTION; GENERALIZED VARIANCE 


Proof. lf C ts a nonsingular matrix such that £=CC’, then A has the 
distribution of CBC', where B has the distribution WU,n), and &A7! = 
(C')"'(&B~')C~'. By symmetry the diagonal elements of @B~' are the 
same and the off-diagonal elements are the same; that is, &B™' =k, J+ 
k.ee'. For every orthogonal matrix Q, @BQ’ has the distribution WC, n) and 
hence £(QBQ')"' = Q&B™'Q' = €B™', Thus k, = 0. A diagonal element of 
B~' has the distribution of ( x/_,,,)7'. (See, e.g., the proof of Theorem 
5.2.2.) Since &( Xi-p4,)°) =(2-—p-1)7', Bo! = (n—p—1)7'F. Then (6) 
follows, | 


We note that (n — p— 1I)A~' =[(n— p—1)/(m — 1)]S~* is an unbiased 
estimator of the precision £~'. 

If » is known, the unbiased estimator of £ is (1/N)LY_.(x, — wx, - 
1)’. The above can be applied with n replaced by N. Note that if n (or N) is 
large. (5) is approximately S, 


Theorem 7.7.3. Let x,,...,Xy) be observations from N(p, 2). Suppose w 
and & have the a priori density n(plv,(./K)2) xX w7'(Z|W,m). Then the a 
posteriori density of p and % givené=(1/N)EN_,x,, and S =(1/n)EX. (x, 
—kKx,-~x)' is 


1 1 
Dalal ROE + KY) GER? 
és NK. ae 
wi [Sh + nS + ae (F—v)(¥- v)!, N+). 


Proof. Since ¥ and nS = A are a sufficient set of statistics, we can consider 
the joint density of x, A, w, and 2, which is 
K® NY am] | NP? gp HN-P-2) 


(8) iNem+ 
2m Deel [L(N—1)]D,(4m) 


exp(~4[N(¥~-p)'27'(F—-p) +traz"! 
+K(p-v) 2 '(p-v) tt wr |}. 


The marginal density of x and A is the integral of (8) with respect to w and 
£. The exponcultial in (8) is — 4 times 


(9) (N+K)p'2"'p-2Ne+Kv)io'p 
+NX'L UX + KL y+ tr( A+ W)z 
1 iene 1 
=(N+K) wm ERM + Ky) |e eR (NE + Ky) 


NAD (E-vy'Eo'(#—v) +tr(At+W)rl. 


+ NK 


7.7 INVERTED WISHART DISTRIBUTION AND BAYES ESTIMATION 275 
The integral of (8) with respect to p is 


K? NiP| wl ime] 7 Nt tet | NP?) 
(N+K)#24%+™ evr [LON — 104m) 


(10) 


-exp{ - 5 tr4d~! + ee (E- vB E-v) +tr wx-']\. 


In turn, the integral of (10) with respect to % is 


i = K¥ ND ([3(N+m)] 
0) FEE LLN= DIT, (wa Ky 


(N= p~ 3m NK r + ’ ~iNem 
[Al KN-P-2)] ea] Ue + A + ae (Ew )(E— wy, 


The conditional density of and & given xX and 4 is the ratio of (8) to (11), 
namely, 


(N+ KY M1Z| Ome Da a AE (EH v (EH vy 


qi Nem+l PrP LT [S(N Ee m)| 


(12) 


exp 3 |(W +) [w= neg NE + Ko) E-"|n- WeK(NE+Ky)| 


NK __ _ meee 
tir Wat Ph E(e-v)(z-v)'|2 at 
Then (12) can be written as (7). a 


Corollary 7.7.3. If x,,...,x, @re observations from N(p, 2), if p and & 
have the a priori density nlplv,(1/K)2]xw'(Z|W,m), and if the loss 
function is (d — p)'J(d — w) — tr(D — 3)G(D — £)H, then the Bayes estima- 
tors of wp and & are 


(13) WoK (NE+Ky) 
and 

1 NK ,. a , 
(14) N+m-p-1 nS+W+ STR (xv) ev) ‘5 
respectively, 


The estimator of p is a weighted average of the sample mean x and the 
a priori mean v. If N is large, the a priori mean has relatively little weight. 


276 COVARIANCE MATRIX DISTRIBUTION; GENERALIZED VARIANCE 


The estimator of £ is a weighted average of the sample covariances S, WV, 
and a term deriving from the difference between the sample mean and the a 
priori mean. If N is large, the estimator is close to the sample covariance 
matrix. 


Theorem 7.7.4. If x,,...,X,y are observations from N(p, 2) and if and 
2 have the a priori density n{plv,(1/K)2] x w7'(2|W, m), then the marginal 
@ posteriori density of given x and S is 
(15) 
(N+K)*T[4(N +m+1)][B17? 
mT [E(N+m+1—p)] [1+ (N+K)(w— wty'Bo (wpe) 


where * is (13) and Bis N+m—p —1 times (14). 
Proof. The exponent in (12) is — } times 


(16) tr[B+(N+K)(w—pt)(n— pt) 20 


Then the integral of (12) with respect to & is 


(N+K)"D,[3(N +m + LIB Ye 


Since |B + xx'| = |B|(. +x’B~'x) (Corollary 4.3.1), (15) follows. | 


The density (15) is the multivariate ¢t-distribution with N-+m+1-—p 
degrees of freedom. See Section 2.7.5, Examples. 


7.8. IMPROVED ESTIMATION OF THE COVARIANCE MATRIX 


Just as the sample mean x can be improved on as an estimator of the 
population mean p when the loss function is quadratic, so can the sample 
covariance S be improved on as an estimator of the population covariance 2 
for certain loss functions. The loss function for estimation of the location 
parameter p was invariant with respect to translation (x >x +4, w—> wp +a), 
and the risk of the sample mean (which is the unique unbiased function of 
the sufficient statistic when £ is known) does not depend on the parameter 
value. The natural group of transformations of covariance matrices is multi- 
plication on the left by a nonsingular matrix and on the right by its transpose 


78 IMPROVED ESTIMATION OF THE COVARIANCE MATRIX 277 


(x > Cx, S—> CSC’, 2 > CXC’). We consider two loss functions which are 
invariant with respect to such transformations, 
One loss function is quadratic: 


(1) L(2,G) =t(G-)E(G-Z)E7 
=t(GE"!-71)’, 


where G is a positive definite matrix. The other is based on the form of the 
likelihood function: 


(2) L,(2,G) =trGZ7! — loglGE-"| —p. 


(See Lemma 3.2,2 and alternative proofs in Problems 3.4, 3,8, and 3.12.) Each 
of these is 0 when G= and is positive when G # 2, The second loss 
function approaches oo as G approaches a singular matrix or when one or 
more elements (or one or more characteristic roots) of G approaches x. (See 
proof of Lemma 3.2.2.) Each is invariant with respect to transformations 
G* =CGC', 2* =CXC’. We can see some properties of the loss functions 
from L,I, D)=L?_\(d,,- 1)? and L,U, D) = L?.,(4,, — logd,, — 1), where 
D is diagonal. (By Theorem A.2.2 of the Appendix for arbitrary positive 
definite £ and symmetric G, there exists a nonsingular C such that CLC’ =I 
and CGC’'=D.) If we let g = (811,---5 8p) 812+ ++ Bpeip)s S= 
(Stts+ ees Spy Siaye009 Sp, ne T= (Oy, +26 Bye Figse ees Fy, »), and @= 
é(s—o\s~—o)’, then L (2G) is a constant multiple of (g— 0)'®"'(g-—o). 
(See Problem 7.33.) 

The maximum likelihood estimator £ and the unbiased estimator S are of 
the form aA, where A has the distribution W(2, 7) and n=N- 1. 


Theorem 7.8.1. The quadratic risk of aA is minimized ata =1/(n +p +1). 
and its value is p(p + 1)/(n +p +1). The likelihood risk of aA is minimized at 
a=1/n(ie., aA=S), and its value of plogn— L?_, & log x;7.,-,. 

Proof. By the invariance of the loss function 


(3) &;L,(%,aA) = &L,(1,aA*) 


&, tr(aA* —1) 


P P 
é,|a’ af? -—2a Sai, +p 
i,j 


i=l] 
=a[(2n +n°)p +np(p- 1)] —2anp+p 
=p|n(n +p +1)a? —2na +1]. 


278 COVARIANCE MATRIX DISTRIBUTION; GENERALIZED VARIANCE 
which has its minimum at a4 = 1/(n + p + 1). Similarly 
(4) &yL,(2%,aA) = &,L,(1, aA*) 
= &,{atr A* — log|A*| — p log a — p} 
= p[na — loga — 1] — &, loglA*l, 
which is minimized at a = 1/n. | 


Although the minimum risk of the estimator of the form aA is constant for 
its loss function, the estimator is not minimax. We shall now consider 
estimators G(A) such that 


(5) G( HAH’) = HG(A)H’ 


for lower triangular matrices H. The two loss functions are invariant with 
respect to transformations G* = HGH’, 1* = HX H’. 

Let 4 =I and H be the diagonal matrix D, with ~1 as the ith diagonal 
element and 1 as each other diagonal element. Then HAH’ =I, and the 
i, jth component of (5) is 


(6) 8,4) = -8,;(1), j#i. 


Hence, g,,(1)=0, i #j, and G(/) is diagonal, say D. Since A =TT’ for T 
lower triangular, we have 


(7) G(A) =G(TIT’) 
=TG(I)T’ 
= TDT', 


where D is a diagonal matrix not depending on A. We note in passing that if 
(5) holds for a/? nonsingular H, then D = al for some a. (H can be taken as 
a permutation matrix.) 

If % = KK’, where K is lower triangular, then 


(8) 
&L[2,G(A)| = [LLZ,G( A) C(p, m3 “341 q[Ha-p-W) eo iE" gy 
= {LL KK’, G(A)|C(p,n)|KK'| AL? 


fag clgeecl 
-@ gtr K''K dA 


7.8 IMPROVED ESTIMATION OF THE COVARIANCE MATRIX 279 
= [L[ KK’, G(KA*K")]C(p, n)|A*| #0 -P-P eo dat 
= &L[ KK’, KG(A*) K’| 
= 6,L[1,G(A*)] 


by invariance of the loss function. The risk does not depend on . 
For the quadratic loss function we calculate 


(9) &L,[1,G(A)] = ¢L,[1,1DT'] 
= & uw (TDT' —1)’ 
= é,tr(TDT’TDT' — 2TDT' +1) 


p p 
=F, Le ty dtejterdity—26, L thd, tp. 


ip, k fel pel 


The expectations can be evaluated by using the fact that the (nonzero) 
elements of T are independent, t) has the y?-distribution with n+1—i 
degrees of freedom, and ¢,,, i> J, has the distribution N(0, 1). Then 


(10) &,L,|1,G(A)] =d'Fd —2f'd +p, 
where F=(f;,), f= (fy), 


(11) f, =(nt+p— 2i+1)\(n+p— 2i +3), 
fan tp Hope, i<yf, 


fpentpt2i+l, 


and d=(d,,...,d,)’. Since d’Fd = & tr(TDT’)’ > 0, F is positive definite 
and (10) has a unique minimum. It is attained at d = F~'f, and the minimum 
is p—f'F'f. 


Theorem 7.8.2. With respect to the quadratic loss function the best estimator 
invariant with respect to linear transformations 2 ~ HH’, A — HAH", where 
H is lower triangular, is G(A) = TDT’, where ‘D is the diagonal matrix whose 
diagonal elements compose d = F~'f, F and f are defined by (11), and A= TT’ 
with T lower triangular. 


280 COVARIANCEMATRIX DISTRIBUTION; GENERALIZED VARIANCE 


Since d=F~'f is not proportional to ¢ =(1,...,1)’, that is, Fe is not 
proportional to f (see Problem 7.28), this estimator has a smaller (quadratic) 
loss than any estimator of the form aA (which is the only type of estimator 
invariant under the full linear group). Kiefer (1957) showed that if an 
estimator is minimax in the class of estimators invariant with respect to a 
group of transformations satisfying certain conditions,’ then it 1s minimax 
with respect to all estimators. In this problem the group of triangular linear 
transformations satisfies the conditions, while the group of all linear transfor- 
mations does not. 

The definition of this estimator ia on the coordinate system and on 


the numbering of the coordinates. These properties are intuitively unappeal- 
ing. 


Theorem 7.8.3. The estimator G(A) defined in Theorem 7.8.2 is minimax 
with respect to the quadratic loss function. 


In the case of p = 2 


(12) jc OEY SOR). pes os EL 
' (n+1)'(n+3)—(n—-1)’ 4, (n+1)'(n +3) -(n-1) 

The risk is 

(13) 3n?>+5n+4 


n+5n°+6n+4 


The difference between the risks of the best estimator ad and the best 
estimator TDT’ is 


(14) 6 6m +10n+8 2n(n—1) 
nt+3  n?4+5n?+6nt+4 (nt+3)(n>+5n?>+6n+4)- 


The difference is zs for n = 2 (relative to 3), and 3 for n = 3 (relative to 1); 
it is of the order 2/n*; the improvement due to using the estimator TDT’ is 
not great, at least for p = 2. 

For the likelihood loss function we calculate 


(15) 6;,L,[1,G(A)] 
= &,L,[1,TDT'] 
= &[tr TDT’ —logl TDT'| — p] 


The essential condition is that the group is solvable. See Kiefer (1966) and Kudo (1955). 


7.8 IMPROVED ESTIMATION OF THE COVARIANCE MATRIX 281 


P 5 P P 
= éy > ti d, _ Ly log th =~ y; log d, —p 
Jj i=] 


i ) r=} 


P P P 
= Vi (nt+p-2j+1)d,— Y logd,— Li & log x,.\_, -p. 


je! j=l }=l 
The minimum of (15) occurs at d,=1/(n+p—2j+1), j=1,...,p. 


Theorem 7.8.4. With respect to the likelihood loss function, the best estima- 
tor invariant with respect to linear transformations 1 > HXH', A— HAH’, 
where H is lower triangular, is G(A) = TDT’, where the jth diagonal element of 
the diagonal matrix D is 1/(n + p-—2j +1), j=1,...,p, and A=TT", with T 
lower triangular. The minimutn risk is 


P P 
(16) &sL[Z,G(A)] = L login tp—2j+ly—- Ye log xi. 


j=l J=1 


Theorem 7.8.5. The estimator GCA) defined in Theorem 7.8.4 is minimax 
with respect to the likelihood loss function. 


James and Stein (1961) gave this estimator. Note that the reciprocals of 
the weights 1/(n + p— 1),1/(7 + p — 3),....1/(n — p + 1) are symmetrically 
distributed about the reciprocal of 1/n. 


If p=2, 
1 [Al 
(17) G(A)=7a74t 0 ae a,” 
ne Sl 
0 0 
18) $6 eS eS | 
( ( A+] n+110 ee 
11 


The difference between the risks of the best estimator aA and the best 
estimator TDT’ is 


Z = p-2j+I 
(19) plogn— ¥ log(nt+p—2j+1)=- Y log( + Po), 


J=1 jet 


282 COVARIANCE MATRIX DISTRIBUTION; GENERALIZED VARIANCE 


If p =2. the improvement ts 


(20) ~log{1+ 5) ~ tog{1- 2} = -tog(1- 4} 


S 
n 


oct, dake aia 
2n*  3n® 


which is 0,288 for n = 2, 0.118 for n = 3, 0.065 for n = 4, etc. The risk (19) is 
O(1/n*) for any p. (See Problem 7.31.) 

An obvious disadvantage of these estimators is that they depend on the 
coordinate system. Let P, be the ith permutation matrix, i=1,..., p!, and tet 
PAP; =T,T/, where T; is lower triangular and t,, >0, j=1,...,p. Then a 
randomized estimator that does not depend on the numbering of coordinates 
is to let the estimator be P'T;DT'P, with probability 1/p!; this estimator has 
the same risk as the estima.or for the original numbering of coordinates, 
Since the loss functions are convex, (1/p!)L,P'T, DTP, will have at least as 
good a risk function; in this case the risk will depend on &. 

Haff (1980) has shown that G(A) = [l/(n +p +1)(A + yuC), 
where y is constant, 0< y<2(p—1)/(n—p +3), u=1/tr(A7'C) and C is 
an arbitrary positive definite matrix, has a smaller quadratic risk than 
[1/(1 +p + 1)JA. The estimator GCA) = (1/n)[A + ut(u)C], where t(u) is an 
absolutely continuous, nonincreasing function, 0 <¢t(u) < 2(p—1)/n, has a 
smaller likelihood risk than S. 


7.9. ELLIPTICALLY CONTOURED DISTRIBUTIONS 


7.9.1. Observations Elliptically Contoured 


Consider x,,...,Xy Observations on a random vector X with density 
(1) [Al ~ig[(x-—v)'A7'(x—v)]. 


Let 4=D%_ (x, =¥Xx,-¥)', n=N-1, S=(1/n)A. Then SX as 
N-> co. The limiting normal distribution of VN vec(S — 2) was given in 
Theorem 3.6.2. 

The lower triangular matrix T, satisfying A = TT’, was used in Section 7.2 
in deriving the distribution of A and hence of S. Define the lower triangular 


matrix T by S=TT',7,>0,i=1,...,p. Then T=(1/yn)F. If ¥ =J, then 


7.9 ELLIPTICALLY CONTOURED DISTRIBUTIONS 283 


S41 and T 21, VN(S—1) and YN(T—1) have limiting normal distribu- 
tions, and 


(2) VN (S—1) =V¥N(T-1) + VN(T-1)'+0,(1). 


That is, YN(s;,,— 1) = 2VN(@, — 1) + 0,(1), and VNs,, = VNi,, + O,(1), i>/. 
When 2 =J, the set YN(s,, — 1),...,VN(s,, — 1) and the set VNs,j, 
are asymptotically independent; Y/N. Sigpiaks YN Sp-1, p are mutually asymptot- 
ically independent, each with variance 1+ «; the limiting variance of 
VN (s,,— 1) is 3« + 2; and the limiting covariance of VN (s,,— 1) and VN (sy, 
— 1), 1#j, is k. 


Ly; 


Theorem 7.9.1. If 2 =1,, the limiting distribution of YN(T- I,) is normal 
with mean 0. The variance of a diagonal element is (3« + 2)/4; the covariance of 
two diagonal elements is x /4; the variance of an off-diagonal element is « + 1; 
the off-diagonal elements are uncorrelated and are uncorrelated with the diagonal 
elements. 


Let X =v + CY, where Y has the density g(y’y), A= CC’, and = &(X 
—- v)(X— v)’ =(€R*/p)A=ST', and C and FL are lower triangular. Let S 
be the sample covariance of a sample of N on X. Let S=7T’, Then SX, 
TT, and 
(3) VN (S—%)=VN(T-V)P'+PVN(T-L)'+0,(1). 
The limiting distribution of VT (T —J') is normal, and the covariance can be 
calculated from (3) and the covariances of the elements of YN(S — X). Since 


the primary interest in T is to find the distribution of S, we do not pursue 
this further here. 


7.9.2, Elliptically Contoured Matri:: Distributions 
Let X (N Xp) have the density 


@) IC1Ng[ C7 *(X— eyv')'(X— eyv')(C) |] 


based on the left spherical density g(Y’Y). 


Theorem 7.9.2. Define T=(t,,) by Y'Y=TT', t,,=0, i<j, and t,,>0. If 
the density of Y is g(Y’Y), then the density of T is 
pe QqiiNtl-1) 2a 3Np 
a = Oa ee Lae TT’ i TT’). 
2 UI aewsi-o) aye ran Ld ou) 


284 COVARIANCE MATRIX DISTRIBUTION; GENERALIZED VARIANCE 


Proof. Let Y=(v,,...,v,). Define w; and w; recursively by w) = v,, uy) = 


w,/llw Il, 
Wid; im] 


i-2 
(6) Ww, = Dv, — wa =u,- Lujuiy,, 
j=l [I I =! 


and u;=w,/|lw,|l. Then wiw, = 0, wu, = 0,7 #j, and uu, = 1, Conditional on 
Dy,..65D;-, (that is, w,,,...,4;-,), let Q@, be an orthogonal matrix with 
u,-..,U;., as the first i~— 1 rows; that is, 


(7) QO, = (uy, By, oe 
(See Lemma A.4.2.) Define 


(8) 2, = Q,0; = 


This transformation of v, is linear and has Jacobian 1. The vector z* has 
N +1-i components. Note that ||z* ||? =|lw,ll’, 


t-1 t-1 

(9) v= YL tay tw = Lit, uj; + OF z7, 
j=l i=] 
i-I i 

(10) yom Lit tart = Loe, 
jz j=l 
j 

(11) r= States j<i. 
k=l 


The transformation from Y= (04, .46,,) to z,,...,2, has Jacobian 1, 
To obtain the density of T convert z* to polar coordinates and integrate 
with respect to the angular coordinates. (See Section 2.7.1.) | 


The ahove proof follows the lines of the proof of (6) in Section 7.2, but 
does not use information about the normal distribution, such as 12 = y24,.;. 


See also Fang and Zhang (1990), Theorem 3.4.1. 
Let C be a lower triangular matrix such that A= CC’. Define X= YC’. 


Theorem 7.9.3. /f X (N Xp) has the density 


(12) ici" o'x'xce’y'], 


PROBLEMS 
285 


then the lower triangular matrix T* satisfying X'X = T*T*' and 1,,> 0 has the 
density 


Popp P . x 
2 Par re" 9] Co T#T*(C) “te 


13 ——" ___ |] 
ve PAN )IALS i= 


Let A=X'’X=T*T*’, 
Theorem 7.9.4. If X has the density (12), then A = X'X has the density 


arin HpaI 


COLE 


[Al gl om Lacery'], 


The class of densities g(tr Y’Y) is a subclass of densities g(Y’Y). Let 
X=e,v'+ YC’. Then the density of X is 


(15) [AL~?% gltr(X—eyy’)AT'(X—eyw’)’']. 


A stochastic representation of X is vec X 4 R(C @I,)vecU + v® €,,. Theo- 
rems 7.9.3 and 7.9.4 can be specialized to this form. Then Theorem 3.6.5 
holds. 


Theorem 7.9.5. Let X have the density (12) where A is diagonal. Let 
S=(N—1)7'(X-£,¥/¥(X—€,¥’) and R= (diag S)- 18(diag S)- 3, Then 
the density of R is (9) of Section 7.6. 


PROBLEMS 


7,1. (Sec. 7.2) A transformation from rectangular to polar coordinates is 


y,=wsin @,, 
Y= Wwcos 6, sin 62, 


¥,= cos @, cos 4, sin 93, 


Yn—) = WCOS G, COS Gy --- COS G,_4 SiN 6,_,, 


¥, = WCOS 8, COs 8, --- cos 8,_, Cos 6,_,. 


where —37<6,<9%, i=1,....m-2, -w<6,., <5 7, and O< 
Ww < 00. 


286 COVARIANCE MATRIX DISTRIBUTION; GENERALIZED VARIANCE 


(a) Prove w? = Ly, [Hint: Compute in turn y? + y?_,,(y2+y2_,)+y3_2, and 
so forth.] 


(b) Show that the Jacobian is w”~! cos*~? @, cos”~? @, -- cos 6,2. | Hint: Prove 


cos a, 0 rae 0 0 
0 cos 6, 
Ory, beeen ya) . ¢ _ 
o(4 .,O yw : : : £ 
my eae) 0 0 “cos 6,., O 
wsind, wsind, -- wsin@,_, 1 
lv x ve x x 
Q weos 6, \ x 
= ; F 1 
) 0 ms weos@, - cOs 6,_3 x 
0 0 on 0 cos@#, < cos4_, 


where x denotes elements whose explicit values are not needed.) 
7,2. (Sec. 7.2) Prove that 


m/2 -1 r( h)T (3) 
cos 6de= oo  . 
we T[3(h+1)] 


[ Hint: Let cos? @=u, and use the definition of BC p,q).] 


7,3. (Sec. 7.2) Use Problems 7.| and 7.2 to prove that the surface area of a sphere of 
unit radius in m dimensions is 


ar tn 


C(n) = ran) Thy 


7.4. (Sec. 7.2) Use Problems 7.1, 7.2, and 7.3 to prove that if the density of 
y'=y,,..-,¥,) is fly'y), then the density of u=y’y is 7C(n)f(uu zd 


7.5. (Sec, 7,2) y?-distribution, Use Problem 7.4 to show that if y;,...,y, are 


independently distributed, each according to N(0,1), then U= D"_, y? has the 
density u2"7! e “/(2"T(4in)l, which is the y2-density with 7 aan of 


freedom. 


7.6. (Sec. 7.2) Use (9) of Section 7.6 to derive the distribution of A. 


7.7, (Sec. 7.2) Use the proof of Theorem 7.2.1 to demonstrate Pr{| A] = 0} = 0. 


PROBLEMS 


7.8, 


7.9, 


7.10. 


WA. 


7.12. 


7,13. 


7.14, 


7.15. 


2387 


(Sec, 7.2) Independence of estimators of the parameters of the complex ncrmal 
distribution, Let z,,...,Z,y be N observations from the complex normal distribu- 
tion with mean @ and covariance matrix P. (See Problem 2,64.) Show that Z 
and A= <*_(Z,—Z\Z, —Z)* are independently distributed, and show that 
A has the distribution of L2_,W,W,*, where W,,...,W, are independently 
distributed, each according to the complex normal distribution with mean @ and 
covariance matrix P, 


(Sec, 7.2) The complex Wishart distribution. Let W,,...,W, be independently 
distributed, cach according to the complex normal distribution with mean 0 and 
covariance matrix P. (See Problem 2.64.) Show that the density of B= 
ae is 


[ BI"? en seo! 
[P| "nPP-OTTP P(nt1—i) 


(Sec, 7.3) Find the characteristic function of A from W(Z,n), [Hint From 
{[w(A| 2, 2) dd =*, one derives 


SQ1-p-1) = -] 
ye P exp( str ® A) Ac ragite 


22"'T $n) 


as an identity in ®.} Note that comparison of this result with that of Section 
7.3.1 is a proof of the Wishart distribution. 


(Sec. 7.3.2) Prove Theorem 7.3.2 by use of characteristic functions. 


(Sec. 7.3.1) Find the first two moments of the elements of A by differentiating 
the characteristic function (11). 


(Sec. 7.3) Let Z,,...,Z,, be independently distributed, each according to 
NO, 1), Let W= Li, p21 bygZ_Zy- Prove that if a’Wa = x, for all @ such that 
a'a=1, then W is distributed according to W(J, m). [ Hint: Use the characteris- 
tic function of a’Wa.} 


(Sec. 7.4) Let x, be an observation from N(Bz,,2), a= 1,..., N, where z, is 
a scalar. Let b= L,2,X,/%,22. Use Theorem 7.4.1 to show that 5, x, x, - 
bb'y.,z2 and bb’ are independent. 


(Sec. 7.4) Show that 


S(xi-ixha) =o (xdva/4) h>0, 


by use of the duplication formula for the gamma function; yg_, and yg_2 are 
independent. Hence show that the distribution of x4, X4-2 is the distribution 
of Xin—4/4- 


288 


7.16. 


7.17. 


7.18. 


7,19. 


7.20. 


COVARIANCE MATRIX DISTRIBUTION; GENERALIZED VARIANCE 


(Sec. 7.4) Verify that Theorem 7.4.1 follows from Lemma 7.4.1, [Hint Prove 
that Q, having the distribution W(, r,) implies the existence of (6) where J is 
of order r, and that the independence of the Q,’s implies that the J’s in (6) do 
not overlap.] 


(Sec. 7.5) Find &|A|* directly from WCE, 7), (Hint: The fact that 
fw(Al¥,n) da =| 
shows 
flAl3-P-D exp( - fur E1A) dt = 22"? E17", (5) 
as en identity in 7.! 


(Sec. 7.5) Consider the confidence region for p given by 
- yom ae (N-1)p 
N(x- p*)'S7'(2- pt) s Ware, n-p(&)»° 
where ¥ and S are based on a sample of N from N(p, 2). Find the expected 


value of the volume of the confidence region. 


(Sec. 7.6) Prove that if © =J, the joint density of r,.,, i,j=1,...,p — 1, and 
Vipers lay 1S 


PW A/ATTPOITTS (nn —i)] i=1 mr[dcn- 1)] "Ip ’ 


where Ry, = (rip.p). [Hints rip = (ry — hippy) OY 1 mea y! ~rj,) and |r,)| = 
l~l—ri, V1 - "fp rizpl. Use ©). 


(Sec. 7.6) Prove that the joint density of ry... 


oe pr aed, gpa. preeny 
Pipreeesl ped, p RS 


Fals= (eI _ yaesee 
mT{4[n—(p-1)]} Ma 3..p 
: - ae ee 
ist eee 7 |S eeeaaiad P 
7 FE@= 11 a) qln-4) 
i=] wT[4 (n—2)] 1" trio) 
aa T 3n) 


[ Hint: Use the result of Problem 7.19 inductivity.} 


PROBLEMS 


721. 


7.22, 


7.23, 


7,24. 


7.25, 


7.26. 


727. 


7.28. 


7.29. 


7.30. 


289 


(Sec. 7.6) Prove (without the use of Problem 7.20) that if Z=/J, then 
Y\preves Tyo, p are independently distributed. [ Hint: r;, = ay /Yay ay, )- Prove 
that the pairs (a,,, @):),-.+,(@p-1. ys @p-1.p-1) are independent when 
(2, y)+++5 Zqp) are fixed, and note from Section 4.2.1 that the marginal distribu- 
tion of r;,, conditional on z,,, does not depend on z,,.] 


(Sec. 7.6) Prove (without the use of Problems 7.19 and 7.20) that if © = J, then 
the set r,,,.-.;%y-1, p 18 independent of the set rj,, j=... p~ |. (Hine: 
From Section 4.3.2 a,,, and (a,,) are independent of (a,,.,). Prove that 
@y,a;,), and aj, = 1,....p—1, are independent of (y,,.,) by proving that 


rp 
a,;., are independent of (r;,,,). See Problem 4.21.] 


(Sec. 7.6) Prove the conclusion of Problem 7.20 by using Problems 7.21 und 
7.22, 


(Sec. 7.6) Reverse the steps in Problem 7.20 to derive (9) of Section 7.6. 


(Sec. 7.6) Show that when p=3 and = is diagonal r,.,r,3,r23 are not 
mutually independent. 


(Sec. 7.6) Show that when & is diagonal the set r,, are pairwise independent. 
(Sec. 7.7) Multivariate t-distribution. Let y and u be independently distributed 
according to N(0,X) and the y,-distribution, respectively, and let yn /uy =x — 


pL. 


(a) Show that the density of x is 


T[$(n+p)] 
1 ena 1 1 a) sree) 
T( 57) na? Z|? 1-7 (x p)'® (x—-p) 


(b) Show that &x = p and 


&(x-p)(r~ py = ad. 


(Sec. 7.8) Prove that Fe is not proportional to f by calculating Fe. 


(Sec, 7.8) Prove for p= 2 


0 0 
TDT'=d,A+(d,~d,))9 IAL 


ay 


(Sec, 7.8) Verify (17) and (18). [/int: To verify (18) let S = KK’, A = KA*K’, 
and A* = T*T*, where K and T* are lower triangular.] 


290 


7.31, 


7.32, 


7.33, 


7.34. 


7.35. 


7.36. 


7.37. 


COVARIANCE MATRIX DISTRIBUTION; GENERALIZED VARIANCE 


(Sec. 7.8) Prove for oplimal D 


tp _4: 2 
&)L,. 8) -— &,L,(1, TPT’) = - xy os = [Pas=t) | p even, 


r=t 
H(p-1) 


ue _(pa2itiy’ 
m > os ( "7 ,  podd., 


ri 


(Sec. 7.8) Prove L,(Z,G) and LX,G) are invariant with respect to transfor- 
mations G* = CGC’. X* = CXC’ for C nonsingular. 


(Sec. 7.8) Prove L,(2,G) is « multiple of (g- @)'@"'(g — a). Hint: Trans- 
form so © = J. Then show 


_1(2r 0 
o- 7 (7 "| 


(Sec. 7.8) Verify (11). 


Let the density of Y be f(y)=K for y'y<p+2 and 0 elsewhere. Prove that 
K=T(4p + D/[((p + 2)7}”, and show that #¥=Oand SYY'’=/. 


(Sec. 7.2) Dirichlet distribution. Let Y,,...,Y,, be independently distributed as 
x-variables with p,....,p,, degrees of freedom, respectively. Define Z,= 
Y./Uy21Y,, i= 1....,m. Show that the density of Z,,...,Z,,-, is 


T( 372, p,) tp,-t 


m=} 
i] 
veo apa) = )— 
mor 1 Zz Zim” > Zm=1 oD Zin 
i=l (32;) i=] 


for z,20,1=1,....m. 


(Sec. 7.5) Show that if y¢_, and y<_, are independently distributed, then 
Ys | Xx 2 is distributed as (ys ,)°/4. [Hint: In the joint density of x = xZ_, 
and y= y<_, substitute z= 2yxy. x =x, and express the marginal density of z 
as z’*” ‘h(z). where A(z) is an integral with respect to x. Find h‘(z), and solve 
the dillerential equation. See Srivastava and Khatri (1979), Chapter 3.] 


CHAPTER 8 


Testing the General Linear 
Hypothesis; Multivariate 
Analysis of Variance 


8.1. INTRODUCTION 


In this chapter we generalize the univariate least squares theory (i.e., regres- 
sion analysis) and the analysis of variance to vector variates. The algebra of 
the multivariate case is essentially the same as that of the univariate case. 
This leads to distribution theory that is analogous to that of the univariate 
case and to test criteria that are analogs of F-statistics. In fact, given a 
univariate test, we shall be able to write down immediately a corresponding 
multivariate test. Since the analysis of variance based on the model of fixed 
effects can be obtained from least squares theory, we obtain directly a theory 
of multivariate analysis of variance. However, in the multivariate case there is 
more latitude in the choice of tests of significance. 

In univariate least squares we consider scalar dependent variates x,,..., X, 
drawn from populations with expected values B’z,,...,B’Z,y, respectively, 
where B is a column vector of g components and each of the z, is a column 
vector of gq known components. Under the assumption that the variances in 
the populations are the same, the least squares estimator of B’ is 


N N = 
(1) v| y rate)| ry 22 


An Introduction to Multivariate Statistical Analysis, Third Edition. By T, W. Anderson 
ISBN 0-471-36091-0 Copyright © 2003 John Wiley & Sons, Inc, 


291 


292 TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 


If the populations are normal, the vector is the maximum likelihood estima- 
tor of B. The unbiased estimator of the common variance a7 is 


N 
(2) s= yi (x, -b'z,) /(N-q), 


and under the assumption of normality, the maximum likelihood estimator of 
a? is 6? =(N-q)s?/N. 

In the multivariate case x, is a vector, B’ is replaced by a matrix B, and 
a” is replaced by a covariance matrix &. The estimators of B and &, given 
in Section 8.2, are matric analogs of (1) and (2). 

To test a hypothesis concerning B, say the hypothesis B = 0, we use an 
F-test. A criterion equivalent to the F-ratio is 


1 Go? 

(3) = = —7 

[q/(N-q)]F+1 &; 
where 6G, is the maximum likelihood estimator of o? under the null 
hypothesis. We shall find that the likelihood ratio criterion for the corre- 
sponding multivariate hypothesis, say B = 0, is the above with the variances 
replaced by generalized variances. The distribution of the likelihood ratic 
criterion under the null hypothesis is characterized, the moments are found, 
and some specific distributions obtained. Satisfactory approximations are 
given as well as tables of significance points (Appendix B). 

The hypothesis testing problem is invariant under several groups of linear 
transformations. Other invariant criteria are treated, including the 
Lawley—Hotelling trace, the Bartlett-Nanda-Pillai trace, and the Roy maxi- 
mum root criteria. Some comparison of power is made. 

Confidence regions or simultaneous confidence intervals for elements of B 
can be based on the likelihood ratio test, the Lawley—Hotelling trace test, 
and the Roy maximum root test. Procedures are given explicitly for several 
problems of the analysis of variance. Optimal properties of admissibility, 
unbiasedness, and monotonicity of power functions are studied. Finally, the 
theory and methods are extended to elliptically contoured distributions. 


8.2. ESTIMATORS OF PARAMETERS IN MULTIVARIATE 
LINEAR REGRESSION 


8.2.1. Maximum Likelihood Estimators; Least Squares Estimators 


Suppose x),...,%, are a set of N independent observations, x, being drawn 
from N(Bz,, 2). Ordinarily the vectors z, (with g components) are known 


8.2 ESTIMATORS OF PARAMETERS IN LINEAR REGRESSION 293 


vectors, and the p Xp matrix © and the pq matrix B are unknown. We 
assume N > p +q and the rank of 


(1) = (Zy5..4,2y) 
is q. We shall estimate 2 and B by the method of maximum likelihood, The 
likelihood function is 


' 1 N 
(2) L=(2r)7 "13417 exp] — 4 DO (x, - BYz,)' 2" '(2,— BYz,)|- 
a=! 


In (2) the elements of £* and B* are indeterminates. The method of 
maximum likelihood specifies the estimators of & and B based on the given 
sample x,,Z,,...,Xy,Zy as the &* and B* that maximize (2). It is conve- 
nient to use the following lemma. 


Leanma 8.2.1. Let 


N N 71 
(3) B= E x02 | Y 242,| : 
Then for any p X q matrix F 


N N 
(4) Ls (8a Fa) (ta — Fag)’ = Ly (4, — Bt) (Xa Bz)’ 
a=| a=1 
N 


+(B-F) J) z.%,(B-F)’. 


a=l 


Proof The left-hand side of (4) is 
N 
(5) L [(4q~ Bea) + (BF) 24] [(%q— Beg) +(B- Fz)’; 
a=] 


which is equal to the right-hand side of (4) because 
N 
(0) Lo 24(%_ — Bz)’ =0 


a=] 


by virtue of (3). a 


294 TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 


The exponential in L is ~ } times 


(7) 
N N 
wi! Yo (x, - Bz.) (4, — Btz,) =tr 247) YS (x, — Bz.) (4, — Bz.) 
a= | a= 
+tr=*"'(B-B*)A(B- B*)’, 
where 
N 
(8) A= ¥z,2.. 
a=! 


The likelihood is maximized with respect to B’ by minimizing the last term 
in (7). 


Lemma 8.2.2, /f A and G are positive definite, tr FAF'G > 0 for F #0. 
Proof. Let A= HH', G = KK’. Then 
(9) tr FAF'G = tr FHH'F' KK’ = tr K’FHH'F’K 
=tr( K'FH)(K'FH)'>0 
for F #0 because then K'FH #9 since H and K are nonsingular. a 


It follows from (7) and the lemma that L is maximized with respect to B* 
by B* = B, that is, 


(10) B=ca4"!, 
where 
N 
(11) C=), t20% 
a=! 


Then by Lemma 3.2.2, L is maximized with respect to %* at 
(12) 2a »; (x, — Bz, }(%_—- Bz,)’- 
a=] 


This is the multivariate analog of ¢?=(N-—q)s?/N defined by (2) of 
Section 8.1. 


Theorem 8.2.1. /f x, is an observation from N(Bz,,%), a=1,...,.N, with 
Cea as z,) of rank q, the maximum likelihood estimator of B is given by (10), 
where C=L,,%,21, andA=L,2,2),. The maximum likelihood estimator of & 


is given by (12). 


Be LOMMALUKS UF PARAMETERS IN LINEAR REGRESSION 295 
A useful algebraic result follows from (12) and (4) with F = 0: 


N N 
(13) N&= ¥ x,x,- BAB’ = Ex, x,-c4'c". 


a=! a= 


Now let us consider a geometric InteyptetaOD of the estimation proce- 
dure. Let the ith row of ECs ,Xy) be xf (with N components) and the ith 
row of (z,,...,Zy) be z* (with N components). Then L, 8, z*, being a linear 
combination of the vectors zf,...,z%, is a vector in the q-space spanned by 
Zi,..-,Z2, and is in fact, of all euch vectors, the one nearest to x*; hence, it 
is the projection of x* on the q-space. Thus x} — L, Bz z} is the vector 
orthogonal to the g-space going from the projection of xf on the g-space to 
xf. Translate this vector so that one endpoint is at the origin. Then the set of 
p vectors xf —¥, B12 ; preg XD =) By ay is a set of vectors emanating from 
the origin. NG; = (xs — L, B,z* xt — Lr; B20) is the square of the length 
of the ith dich vector, and NG,, = (x? — ©, Binzh Xx* — aan Bez) )’ is the 
product of the length of the ith vector, the length of the jth vector, and the 
cosine of the angle between them. 

The equations defining the maximum likelihood estimator of B, namely, 
AB'=C’', consist of p sets of g linear equations in q unknowns. Each set 
can be solved by the method of pivotal condensation or successive elimina- 
tion (Section A.5 of the Appendix). The forward solutions are the same 
(except the right-hand sides) for all sets. Use of (13) to compute N & involves 
an efficient computation of BAB’. 

Let X,=(%,,,-- “Xpa) B=(b,,-..,b,)', and B=(B,,---,B,)’. Then 
éx,, =Bjz,- and b, is the least squares estimator of B,. If G is a positive 
definite matrix, then trG L"_\(x, — Fz, (x, — Fz,)’ is minimized by F = B. 
This is another sense in which B is the least squares estimator. 


8.2.2. Distribution of B and & 


Now let us find the joint distribution of 8, (Gi=1,..., p, g=1,...,q). The 
joint distribution is normal since the Big are linear combinations of the X;,- 
From (10) we see that 


A 


(14) éB 


N 


N 
6) X,2,A7! 


as] 


tl 


N 
LX Bz,z,47! = BAa"! 


a=] 


= B. 


296 TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 


Thus B is an unbiased estimator of B. The covariance between p’ and B., 
two rows of B, is 


(15) 
7 7 , N N 
é(B,—6,)(B,-B,) =47'@ D (X,.- €X,,)24 D (Xy- EX )z,47! 
a=] y=l 
N 
=A! YD 8(Xig— OXjq)(Xy— EXy eat, Am! 
a,y=l 


N 
=A! S80 ,%4%)A7! 


ay hy 
o.yel 


N 
Pe a ’ -1 
=A ys D1 j2q%qA 
wel 


To summarize, the vector of pq components (Bi... Bi)’ =vecB’ is nor- 
mally distributed with mean (B},..., Bi)’ = vec B’ and covariance matrix 


a Aq! , O1pA' 

O04 A* On) A7! ee 04 Aq} 
(16) “ 7 

OA! 0,2 A! pie Op, A! 


The matrix (16) is the Kronecker (or direct) product of the matrices 2 and 
A~', denoted by 2 @A™!. 
From Theorem 4.3.3 it follows that N= L%_,x,x),~ BAB’ is dis- 


tributed according to W(2,N ~q). From this we see that an unbiased 
estimator of © is S=[N/(N - q)l]2. 


Theorem 8.2.2. The maximum likelihcod estimator B based on a set of N 
observations, the ath from N(Bz,,, 2), is normally distributed with mean B, and 
the covariance matrix of the ith and jth rows of B i: o,,A-', where A= LZ Zy: 
The maximum likelihood estimator % multiplied by N is independently dis- 
tributed according to WC, N — q), where q is the number of components of z,- 


8.2 ESTIMATORS OF PARAMETERS IN LINEAR REGRESSION 297 


The density then can be written [by virtue of (4)] 


1 1 ~11/B 5 ' $ 
(17) Gay Pr ae ml BH [(B - B).1(B — B) +N3}\}. 


This proves the following: 
Corollary 8.2.1. B and & form a sufficient set of statistics for B and ¥. 
A useful theorem is the following. 


Theorem 8.2.3. Let X_, be distributed according to N( Bz 
and suppose X,,...,X, are independent. 


Dial... N, 


a? 


(a) If w.=Hz, and V=BH™', then X, is distributed according to 
NO w,, >). 

(b) The maximum likelihood estimator of T based on observations x,, on X,,, 
a=1,...,.N,isT= BH-', where B is the maximum likelihood estima- 
tor of B. 

() TC, WW, )f'’ = BAB’, where A=, z, Z,, and the ma.imum likeli- 
hood estimator of NX is N3=L xy x, -T(L,w,w P= Lx, x, - 
BAB’. 

(a) Tf and ¥ are independently distributed. 

Ce) T is normally distributed with mean Y and the covariance matrix of the 
ith and jth rows of T’ is a, (HAH')~ l= 0, HATH, 


The proof is left to the reader. 
An estimator F is a linear estimator of B,, if F= Dy.\f.%,- It is a finear 
unbiased estimator of B;, if 


N 


N P q 
(18) B= OF =@é E fate » f.Bz., = Dy Ze Dee line “he 


ae a=] a=] )=] 4-1 


is an identity in B, that is, if 


. 
(19) Us eee j=i, h=g, 


=0, otherwise. 


A linear unbiased estimator is best if it has minimum variance over all linear 
unbiased estimators; that is, if &(F - B,,)° < &(G - BY for G= Li. , 8%, 
and @G= B,.. 


298 TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 


Theorem 8.2.4. The least squares estimator is the best linear unbiased 
estimator of B,.. 


Proof. Let fie es Li fja%)q be an arbitrary unbiased estimator of 


> and let B,,= LN r4 _1X,,Z,,a"8 be the least squares estimator, where 
g tg a=1h=1%a7had q 
A=C%_.z,2z),. Then 


(20) 
&(B,- B Big \ = 6[6,- Bet (Be- sal] 
= é| Be ao Bie) +2¢/( B..- bes 85> Bie) sy é| Big Bl 


Because B,, and Bg are unbiased, B,,— B..= Laan b 3. - Big = 
N 
er eneitiy zeae oF and 


N P q : 
(21) — B= oe ay are nes 6, > Zhad “a 
where 6, = land 6, =0,i#j. Then 


(22) é(B,- Big}( By Be} 


| 
Me 
M 2 
M ~~ 
WN 
Q 
sy 
> 
wn 
— 
ah 
g 
| 
a 
M4 
IN 
R 
Sy 
> 
oR 
eee 
‘et 


Then (20) implies &( Big — Big” = OC Big ~ Big)” o 
8.3. LIKELIHOOD RATIO CRITERIA FOR TESTING LINEAR 
HYPOTHESES ABOUT REGRESSION COEFFICIENTS 


8.3.1. Likelihood Ratio Criteria 


Suppose we partition 


(1) B= (B, B, ) 


8.3 LIKELIHOOD RATIO CRITERIA FOR REGRESSION COEFFICIENTS 299 


so that B, has g, columns and B, has g, columns. We shall derive the 
likelihood ratio criterion for testing the hypothesis 


(2) H: By = Br, 


where By is a given matrix. The maximum of the likelihood function L for 
the sample x,,..., x, is 


Lon A” 


3 L=(2 Sal ees 
(3) as (29) Egle 
where £,, is given by (12) or (13) of Section 8.2. 
To find the maximum of the likelihood function for the parameters 
restricted to w defined by (2) we let 


(4) Yq =Xq — Bi 2, a=1,...,N, 

where 

(5) r= ([% a=1,...,N, 
on 


is partitioned in a manner corresponding to the partitioning of B. Then y, 
can be considered as an observation from N(B,z”, £). The estimator of B, 
is obtained by the procedure of Section 8.2 as 


N N 
(6) Bou = DL y22"Ay' = Yo (4. — Bizi?)z2" 45 
a=1 a=] 


=(C,- By A.) Aa 


with C and A partitioned in the manner corresponding to the partitioning of 
B and z,, 


(7) C= (C C2), 
Ay, Ay 
s) 7 Ba Ay 


The estimator of % is given by 


. N 


(9) NE,= D (9.7 Brot) (y.- Brat?) 


H 
Mz 
Ned 


Ya — Bo, AB, 


It 
t= 
—, 
» 
i] 


122) (x, za Biz?) - B,., ABs... 


300 TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 
Thus the maximum of the likelihood function over w is 
(10) max L = (2m)? ™|S)7 2% en PN, 
B2,% 
The likelihood ratio criterion for testing H is (10) divided by (3), namely, 


(11) A= 


In testing H, one rejects the hypothesis if A<Aj», where A, is a suitably 
chosen number. 

A speciai case of this oe led to Hotelling’s T?-criterion. If g = q, =1 
q,=0), z,=1, a=1,...,N, and B=B,=yp, then the T?-criterion for 
testing the hypothesis pp = py is a monotonic function of (11) for BT = po- 

The hypothesis = 0 and the T*-statistic are invariant with respect to the 
transformations X* = DX and x*=Dx,, a=1,...,N, for nonsingular D 
Similarly, in this problem the null hypothesis B, = 0 and the likelihood ratio 
criterion for testing it are invariant with respect to nonsingular linear 
transformations. 


Theorem 8.3.1. The likelihood ratio criterion (11) for testing the null 
hypothesis B, ~ 0 is invariant with respect to transformations x* = Dx,, a= 
1,...,N, for nonsingular D. 

Proof. The estimators in terms of x* are 


(12) $*=DC4-'=DB, 


N 
(13) thy & (Dx, — DBz,)( Dx, — DBz,)' =D&_D', 
a=l 


(14) B), = DC, Az! = DB,,, 


Ay 
N A 
(15) v= ( Dx, — DB,,,z2)( Dx, - DB, 2)’ = D2,D’. | 


8.3.2. Geometric Interpretation 


An insight into the algebra developed here can be given in terms of a 
geometric interpretation. It will be convenient to use the following lemma: 


Lemma 8.3.1. 


(16) B,. ~ Bro = (Bin - BY) 412 43) 


83 LIKELIHOOD RATIO CRITERIA FOR REGRESSION COEFFICIENTS 301 
Proof. The normal equation By A = C is written in partitioned form 


(17) (Bin 4y, + Boo 421, Bin Aye + B.42] = (C,,C;). 


Thus By = C, Az! — Big 41. Ay! The lemma follows by comparison with 
(6). Z 


We can now write 
(18) X ~ BZ = (X~ BoZ) + (Boo — By)Z2 + (Byo — BY )Z, 
= (X-BpZ) + (B,., — Bo )Z> 
~(B.. — Bra) 2: + (Bra — BT) Z, 
= (X— ByZ) + (By. ~ Br)Z, 
+ (Bio ~ BY)(Z, —4,, Ax 'Z2) 


as an identity; here X=(x,,...,xy), Z,=(z{?,...,2@), and Z,= 
(z?,...,z@). The rows of Z=(Z', 25)’ span a q-dimensional subspace in 
N-space. Each row of BZ is a vector in the g-space, and hence each row of 
X — BZ is a vector from a vector in the q-space to the corresponding row 
vector of X. Each row vector of X — BZ is expressed above as the sum of 
three row vectors. The first matrix on the right of (18) has as its ith row a 
vector orthogonal to the qg-space and leading to the ith row vector of X (as 
shown in the preceding se:tion). The row vectors of (B,, — B,)Z, are vectors 
in the q,-space spanned by the rows of Z, (since they are linear combinations 
of the rows of Z,). The row vectors of (Bjq— BI XZ, —4,,Ap'Z2) are 
vectors in the q,-space of Z, — Ay, A3'Z,, and this space is in the q-space of 
Z, but orthogonal to the q,-space of Z, [since (Z, — A), A3'Z,)Z) = 0]. Thus 
each row of X -~ BZ is indica;ed in Figure 8.1 as the sum of three orthogonal 
vectors: one vector is in the space orthogonal to Z, one is in the space of Z,. 
and one is in the subspace of Z that is orthogonal to Z). 


Z\ 


A} Byg- Bi) (Zs ~ Arve 22) 


Z2 
BZ (Bo.— Bz iZe 


Figure 8.1 


302 TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 


From the orthogonality relations we have 
(19) (¥~ BZ)(X— BZ)’ 
= (X—BoZ)(X- ByZ)' + (Bp, ~ By)Z2Z5(B,., — Ba)’ 
+ (Bo - BY)(Z, 4, Az!Z,)(Z, ~ Ay Ap'Z,)'(Byo — Br)’ 
=N¥q + (Bo. — Br) 4A2(Bp., — Bo)’ 
+ (Bio - BY)(Ay — 4.4 3'Ay1)(Bin - BY)’. 
If we subtract (B,,, - B,)Z, from both sides of (18), we have 
(20) X ~~ BIZ, - B),Z. = (X— BoZ) + (Bio ~ BY)(Z; ~ 4. Ap!Z,). 
From this we obtain 


(21) N3&,=(X-BiZ, - B,,Z.)(X- BYZ, - B,Z2)’ 
= (X-B,Z)(X- B,Z)’ 
+ (Byo— BY)(Z, — Ay A5!Z2)(Z, ~ Ay Az Z2)’ (Bio — BY)’ 
=Ni,t+ (Bio —BI)(Ay, = A, Az A», (Bio - Br)’. 


The determinant Pr =(1/N?)|(X — BoZ).X — BoZ)'| is proportional 
to the volume squared of the parallelotope spanned by the row vectors of 
X-B,Z (translated to the origin). The determinant |S. =(1/N?)|(X- 
BiZ, — B,,,Z,X — B*Z, — B,,,Z,)' is proportional to the volume squared 
of the parallelotope spanned by the row vectors of X — BUZ, — B,,,Z> (trans- 
lated to the origin); each of these vectors is the part of the vector of 
X ~ BiZ, that is orthogonal to Z,. Thus the test based on the likelihood ratio 
criterion depends on the ratio of volumes of parallelotopes. One parallelo- 
tope involves vectors orthogonal to Z, and the other involves vectors orthogo- 
nal to Z,. 


From (15) we see that the density of x,,...,%, can be written as 
1 Ed na A 
(22) Fay XP ~ 3th EIEN + (Bow — Be) A22(Bo.. — Be)’ 
my Bev HE [NE + (Boe Ba) Aaa( Bau ~ Pa) 


+ (By. — BY)(4y, — A '4n.)(Bia ~ BY)'])). 


Thus, 3, Bio, and B,., form a sufficient set of statistics for 2, B,, and B,. 


8.3 LIKELINOOD RATIO CRITERIA FOR REGRESSION COEFFICIENTS 303 


Wilks (1932) first gave the likelihood ratio criterion for testing the equality 
of mean vectors from several populations (Section 8.8), Wilks (1934) and 
Bartlett (1934) extended its use to regression coefficients. 


8.3.3. Fhe Canonical Form 


In studying the distributions of criteria it will be convenient to put the 
distribution of the observations in canonical form. This amounts to picking a 
coordinate system in the N-dimensional space so that the first q, coordinate 
axes are in the space of Z that is orthogonal to Z,, the next q, coordinate 
axes are in the space of Z,, and the last nm (= N-— q) coordinate axes are 
orthogonal to the Z-space. 

Let P, be a q, Xq, matrix such that 


(23) I =P, A) P, = (PZ )(P2Z2)’; 
and let P,; be a q, X q, matrix such that (A,, .=A,, — A,,4y'A2) 
(24) 1=PiAg2Pi=[Pi(Z,- 41 421Z,)|[Pi(Z)— 42 4a Z2)]'- 


Then define the N X N orthogonal matrix Q as 


?, P,(Z, — Ay, A37'Z2) 
(25) Q=|2,|= P,Z, ’ 
Q, Q, 


where Q, is any n XN matrix making Q orthogonal. Then the columns of 
(26) w=( W, W,)=XQ'=x(Qi 2 Q', | 


are independently normally distributed with covariance matrix & (Theorem 
3.3.1). Then 


(27) éW, = &xXQ) = (BZ, + B,Z,)(Z, —Ajy Ax Z,)'Pi 
=By 4... P,=B,P;', 

(28) éW, = XQ, = (B,Z, + B,Z,)Z; P, 
= (By A), + BLA) Py, 

(29) EW, = &XQ', = BZQ,=0. 


304 TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 


Let 


(30) Tr= (v1. 9) =Bi4i2P1 = Bi Pr’; 
(31) VP, = (Yq,41906+9 Vg) = (Bi Ay + B.A) Po, 
(32) W=(W, W Wy)=(w,..., 


Wo sWa tise sea Woatr es Wy)> 


Then w,,...,wy are independently normally distributed with covariance 
matrix { and éw,=y,, a=1,...,q, and @w,=0, a=qt1,...,N. 

The hypothesis B, = B} can be transformed to B, = 0 by subtraction, that 
is, by letting x, — Biz“ =y,, as in Section 8.3.1. In canonical form then, the 
hypothesis is [,= 0. We can study problems in the canonical form, if we 
wish, and transform solutions back to terms of X and Z. 

In (17), which is the partitioned form of B, A =C, eliminate B,, to obtain 
(33) Bin(An —Ay Ay Ap,) =C,- C,A7'A), 

= X(Z) — Z,Az'4,,) 

=W,P'; 
that is, W, = Big Ar2P =B,,P,! and I, =B,P;'. Similarly, from (6) we 
obtain 


(34) Bo Az + Bi Ay = C, = XZ, = WPL; 


that is, Wy = (By, Ay + Bl4y)P) = B). Py! +Bi4dy Py! and Vy = B)P;' 
B, A, Py’. 


8.4. THE DISTRIBUTION OF THE LIKELIHOOD RATIO CRITERION 
WHEN THE HYPOTHESIS IS TRUE 


8.4.1. Characterization of the Distribution 


The likelihood ratio criterion is the }Nth power of 


IN Eal 
© [NEq+ (Bro — BT) 4y12(Bia - BT)’|/ 


(1) U=)’/N= 


M 


where Aj). =A,, — Aj. Ay9'42,- We shall study the distribution and the 
moments of U whe B, = B71. It has been shown in Section 8.2 that N Eq is 
distributed according to W(%, n), Where n=N-—q, and the elements of 
Bass B have a joint normal distribution independent of N $ a 


84 DISTRIBUTION OF THE LIKELIHOOD RATIO CRITERION 305 
From (33) of Section 8.3, we have 
(2) (Bin = BT) 4n2(Bin — Bi)’ =(F, —-1)) PAu. PCM, - 11)’ 
a (W, = r,)(W, re r,)’, 


by (24) of Section 8.3; the columns of W, — I, are independently distributed, 
each according to N(O, Z). 


Lemma 8.4.1, (Bip — BY)A,).0(Big — BY)’ is distributed according 10 
Ww, q,). 


Lemma 8.4.2. The criterion U has the distribrition of 


_ Ich 
(3) os IG +H]’ 


where G is distributed according to W(2,n), H is distributed according to 
W(X, m), where m=q,, and G and H qre independent. 


Let 
(4) G=N%3, =XX' —XZ'(ZZ')"' ZX’, 
(5) G+H=N Xo + (Bio — By) Ay.2(Bio — BY)’ 
=N%,=YY¥' —YZ5(Z.Z5)'Z.Y', 
where Y=X — B{Z, =X — (BF 0)Z. Then 
(6) G=YyY'-yz'(zz') ‘zy’. 


We shall denote this criterion as U,,,,, where p is the dimensionality, 
m=q, is the iumber of columns of B,, and n=N-—q is the number of 
Ceprees of freedom of G. 

We now proceed to characterize the distribution of [/ as the product of 
deta variables (Section 5.2}, Write the criterion U as 


(7) a a ae i 


where V, = 81, /(81, + Au), 


IG,| IG, +H,| 
v= : —— +, PS eeu 
(8) : Gil IG,_, + H,_,| : P, 


306 TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 


and G, and H, are the submatrices of G and H, respectively, of the first i 
rows and columms. Correspondingly, let y“? consist of the first i components 
of y, =x, — Byz"), a=1,..., N. We shall show that V, is the length squared 
of the vector from y* =(y,,...,y,y) to its projection on Z and ¥,_, = 
(yQ"))..., yf" P) divided by the length squared of the vector from y* to its 


projection on Z, and ¥,_,. 


Lemma 8.4.3. Let y be an N-component row vector and U an r X N matrix. 
Then the sum of Squares of the residuals of y from its regression on U is 


yy’ yy! 
; Uy’ UU’ 
me oT 


Proof. By Corollary A.3.1 of the Appendix, (9) is yy’ — yU’(UU') ‘Uy’, 
which is the sum of squares of residuals as indicated in (13) of Section 8.2. 
a 


Lemma 8.4.4. V, defined by (8) is the ratio of the sum of squares of the 
residuals of y,,.--,¥,;y from their regression on y“'~",...,yQ-") and Z to the 
sum of squares of residuals of y,,,..., y,y from their regression on y'~",..., y~ 
and Z,. 


Proof. The numerator of V, can be written [from (13) of Section 8.2] 


[¥,¥! — ¥,2'(2Z')"'Zy"| 
Gal WY YL, -¥.2'(22') "ZY; | 
VY ¥,2! 


YoY 12" aa 
ZY, ZZ’ 


Kegher mae Nig kd 
WY, yy yr’ 
ZY, Zy\' ZZ! 
YiV/-. Y,2' 


ZY’, ZZ! 


8.4 DISTRIBUTION OF THE LIKELIHOOD RATIO CRITERION 307 


yyi ye[ye. 2°] 
¥i~1 Yaa lt ys ' 
ele [eee 2 
aloes 


-l1 
: yp herier  Y-12" YW. : 
=yhyh-yF(N 2 | ZY", 27" | | at 


by Corollary A.3,1. Application of Lemma 8.4.3 shows that the right-hand 
side of (10) is the sum of squares of the residuals of y* on ¥,_, and Z. The 
denominator is evaluated similarly with Z replaced by Z,. a 


The ratio V, is the 2/Nth power of the likelihood ratio criterion for 
testing the hypothesis that the regression of y* =x — B7,Z, on Z, is 0 (in 
the presence of regression on ¥,_, and Z,); here B*, is the ith row of B}. For 
i=1, gj, is the sum of squares of the residuals of yf = (y,),.--, ¥iy) from its 
regression on Z, and g,,; +h, is the sum of squares of the residuals from Z,. 
The ratio V, = g,;/(g;, +/;,), which is approximate to test the hypothesis 
that regression of y* on Z, is 0, is distributed as x2/( x72 + v2) (oy Lemma 
8.4.2) and has the beta distribution B(v; $n,3m). (See Section 5.2, for 
example.) Thus V, has the beta density 


(11) Bf 053(n +1 —4),3m] 


I[y(n+m+1—-i)| 


[h(n +1-a)|T (Gm) pee ieayer 


for 0 <v <1 and 0 for v outside this interval. Since this distribution does not 
depend on ¥,_,, we see that the ratio V, is independent of ¥,_,, and hence 
independent of |V,,...,V,_,. Then V;,...,V, are independent. 


Theorem 8.4.1. The distribution of U defined by (3) is the distribution of the 
product I1P.,V,, where V,,...,V, are independent and V, has the density (11). 


The cdf of U can be found by integrating the joint density of Vj,...,V, 
over the range 


P 
(12) [[V, su. 
i=] 


308 TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 


We shall now show that for given N—q, the indices p and q, can be 
interchanged; that is, the distributions of U, 4 y-g,~¢, = Up,m,n and of 
Up. N-q2-p = Um, pn+m-p are the same. The joint density of G and W, 


defined in Section 8.2 when £ =/ and B, = 0 is 


[G28 -P 2) go at Gai WW; 


Zire gr PP O/ATIP TL L(n+1—i)] (2a)? 


(13) 


Let G+ W,W! =J = CC" and let W, = CU. Then 


—I6l_ leer—cuu'c"| | 
(14) eas _ IG i WW) | <a aon ——- = 7, UU | 
1, U| lt, U' 
= = =F -U'ul: 
U'oUd'£8 U I, ee Ul 


the fourth and sixth equalities follow from Theorem A.3.2 of the Appendix, 
and the fifth from permutation of rows and columns. Since the Jacobian of 
W, = CU is mode|C|” = |J| ?”, the joint density of J and U is 

JI Mn+m=—p~1 Yori 
Qint mpg ele-O/4T TP T[d(n t+mt1—i)| 


(15) 


py {Tibet ti] | ly cur 
Tla(n+1-2)] amp 


i=l 
for J and J,- UU’ positive definite, and 0 otherwise. Thus J and U are 
independently distributed; the density of J is the first term in (15), namely, 
w(J|I,,n +m), and the density of U is the second term, namely, of the form 


~ yyr|senne-) 
(16) K|I, — UU'| #7? 


for I, — UU' positive definite, and 0 otherwise. Let J, =U’, p* =m, m* =p, 
and n* =n +m -— p. Then the density of U, is 


(17) Kir U0, Pe) 


for I,—U;,U,, positive definite, and 0 otherwise. By (14), lI, -U,U,| = 
lL, ~ U,Ui,|, and hence the density of U,, is 


(18) KUIye — U, Uy [HOP -», 


8.4 DISTRIBUTION OF THE LIKELIHOOD RATIO CRITERION 309 


which is of the form of (16) with p replaced by p* =m, m replaced by 
m* =p, and n—p-1 replaced by n* — p* -1=n—p—1. Finally we note 
that U, 2 given by (14) is [7,, -U,U,| =U, pntm-p- 
Theorem 8.4.2. When the hypothesis is true, the distribution of U, 


1-41-42 
is the same as that of U,, w (i.e., that of U, mn is that of U,, 


“Pd. .p.utin-p ” 


8.4.2. Moments 
Since (11) is a density and hence integrates to 1, by change of notation 


(19) [ov = 0)?" do= B(a,b) Nery 


From this fact we see that the Ath moment of V, is 


1 T[s(n+m+1-i)] 


(20) @V i= o T[EE(n+1—i)|P (Gm) 


1 Lm 
pant ye Y dy 


_ Tk +1 -i) +All [nt mt1—i] 


~~ TE(nt 1 -)]T [h(n tm4i1 i) +h)" 


Since V,,...,V, are independent, #U" = Ti? Vv" =[1, &V,". We obtain 
the following theorem: 


Theorem 8.4.3. The hih moment of Ulifh > - 4(n +1 —p) is 


» eran t 1s) tall Gn tmti-a) 
(21) eU * sical Ka tme—n +A] 


Si ee ee] 


P 
ied TLS(N- 4-92 +1 ~O I [E(N- a2 41-1) +A)" 


In the first expression p can be replaced by m, m by p, and n by 
n+m— p. 
Suppose p is even, that is, p = 2r. We use the duplication formula 


(22) Pat dyr(a 41) = erat) 


310 TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 


Then the Ath moment of U, iS 


Wf. mn 


ae: [$(m+n +2) -j] [3(m +n +1) ~] 
EU TL eT Hie oT 
Ti s(n +2) -s|P [s(n +1) -J] 

_ ay { (mtn tl-2j)0(nt+1—2j + 2h) 


jel 


It is clear from the definition of the beta function that (23) ts 


1 T(mt+tn+1—- 2}) yintt=2)42h~1 - m~t 
(24) L Hi f P(n+1—2j;)l(m)? a a 


r r A 
= ey2h_ « 2 
~ They e{ Ty) . 
j=l jel 


where the ¥ are independent and Y, has density B(y;n+1-2j, m). 
Suppose p is odd; that is, p= 2s +1. Then 


A 
(25) EUS, 41 min a é 112: Z,.1] ’ 


= 1 


where the Z, are independent and Z, has density B(z,;n +1 -2i,m) for 
i=1,...,s and Z,,, is distributed with density B[z;(n +1 —p)/2,m/2). 


Theorem 8.4.4. U,,,, , is distributed as T1‘_,Y,’, where Y,,...,Y, are 
independent and Y, has density B(y,n + 1 — 21, m), Uy. 41. m.n iS distributed as 
Ms. ,27Z,.,, where the Z,, i=1,...,5, are independent and Z, has density 
Blzin + 1 - 2i,m), and Z,,, is independently distributed with density B{z;3(n 
+ b= p)4in], 


8.4.3. Some Special Distributions 


p=l 

From the preceding characterization we see that the density of U, ,,,, 1S 
T i +m l ase 

(26) [3(n ) ur". nye C 


P'(3n)P(qm) 


84 DISTRIBUTION OF THE LIKELIHOOD RATIO CRITERION 311 


Another way of writing U, ,,, is 


1 1 
L+5P,Y7/2,, 1+(m/n)Fin’ 


(27) aici Fy 


is an F-statistic, Thus 


n 


where g,, is the one element of G=N¥, and ee 


eae! ee ee 
C®) l,m,n m = Finn 

Theorem 8.4.5. The distribution of [1~ U1: 9.,)/Usmnl'n/m is the 
F-distribution with m and n degrees of freedom, the distribution of 
[1 ~U, 1 n/ Up nl (a +1—p)/p is the F-distribution with p and n+1—p 
degrees of freedom. 


p=2 
From Theorem 8.4.4, we see that the density of YU, ,,, is 


and thus the density of U, ,,,, is 


T(ntm-1) x3 me 
(30) PH CEE Ci (1 = vu ) ; 


From (29) it follows that 


l= Vain n-] 


(31) i" =F 


2m,2(a-1)" 
y Ur m,n ee 


Theorem 8.4.6. The distribution of (U1 — U2 min / Yo min IC — L)/m 
is the F-distribution with 2m and 2(n —1) degrees of freedom; the distribution 
of [A - VU,,2,n )/ VU, 2,0 ]:(n+1-—p)/p is the F-distribution with 2p and 
2(n + 1 — p) degrees of freedom. 


p Even 

Wald and Brookner (1941) gave a method for finding the distribution of 
U m,n fOr p or m even. We shall present the method of Schatzoff (1966a). It 
will be convenient first to consider U, ,,,, for m = 2r, We can write the event 
17.1, su as 


(32) Y,; +" +¥,> —logu, 


312 TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 


where Y,,...,¥, are independent and Y= —logV, has the density 
(33) K, e732 tov — envy = K, E(- 1) ie He tteet ie 


for 0 <y < oo and 0 otherwise, and 


P[a(a+1-i) +r] ae 
34 = = ——_—_| | ——,—— . 
CD 8 Tiati=are) GOI Lb 2 
The joint density of Yj,...,¥, is then a linear combination of terms 
exp[—Lf.1a,y,]. The density of W, = £/_,Y, can be obtained inductively from 
the density of W_, = L/zly, and Y Y, | =2,..., p, which is a linear combina- 


tion of terms wi e-1*41%1, The density of W, consists of linear combina- 


tions of 


k+l 


: w 
(35) 22m fk ef 4) dw = e4)"). TI if ay=c, 
0 
wk-A 
cw, i 
=e Ec 1)" cm! Wi (f= ay 
t 
+(-1)**! J if a, #C, 
(¢ ~a;) 


The evaluation involves integration by parts. 


Theorem 8.4.7. If p is even or if m is even, the density of U, m,, can be 
expressed as a linear combination of terms (— log u)*u', where k is an integer and 
L is a half integer. 


From (35) we see that the cumulative distribution function of —logU is a 
linear combination of terms w* e~!” and hence the cumulative distribution 
function of U is a linear combination of terms (— log u)‘u'. The values of k 
and / and the coefficients depend on p, m, and n. They can be obtained by 
inductively carrying out the procedure leading to Theorem 8.4.7. Pillai and 
Gupta (1969) used Theorem 8.4.3 for obtaining distributions. 


8.4 DISTRIBUTION OF THE LIKELIHOOD RATIO CRITERION 313 


An alternative approach is to use Theorem 8.4.4. The complement to the 
cumulative distribution function U, ,,,,, |S 


(36) 


=f [a ey. - T] BQ ln +1 2i, m) dy, --- dy, dy,. 
a" ae | 


In the density, (1 —y,)"~' can be expanded by the binomial theorem. Then 
all integrations are expressed as integrations of powers of the variables. 
As an example, consider r= 2. The density of Y, and Y, is 


(37) Cyt? yt -4(1~y "(1 y)" | 


Pye eo MOB ICO pai cace 
= 6 2 Cai Dim 7-age 


where 


- T(in+m~1jC(n+m—3) 
a noe T(n-1)P(n—-3)I?(m) © 


The complement to the cdf of U, ,,.,, 1s 


(39) Pr{U, mon 24} =C s [(m-1)!1]*(-1)" 


, (m i Dim —j- iy 
oe N-~2+t ae dty 
fof yp yes) dy, dy, 


_ m=) [(m-1)! (- iy 
Ce Gea 


he [yin 241 — yrds yltens | dy, 


The last step of the integration yields powers of Yu and products of powers 
of yu and logu (for 1+i-—j=—1). 


314 TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 


Particular Values 

Wilks (1935) gives explicitly the distributions of U for p=1, p=2, p=3 
with m=3; p=3 with nm=4;, and p= 4 with m=4. Wilks’s formula for 
p = 3 with -» = 4 appears to be incorrect; see the first edition of this book. 
Consul (1966) gives many distributions for special causes. See also Mathai 
(1971). 


8.4.4. The Likelihood Ratio Procedure 


Let u, ,,,(a) be the @ significance point for U, ,, 3 that is, 
(40) Pr{U, nin <= Wi ih a)|H true} =a, 


It is shown in Section 8.5 that -[n— 3(p—m+D]logU,,,., has a limiting 
x°-distribution with pm degrees of freedom. Let y,,,(a@) denote the a 
significance point of y;,,, and let 


- er - \( pout 1)] log Uy mn &) 


Cc = 
an Xpm( &) 


a) = 


See rant 


Table B.1 {from Pearson and Hartley (1972)] gives value of C, ,, (a) for 
a =().10 and 0.05. p= 1(1)10. various even values of m, and M=n-p+1 
= 1(1)10(2)20, 24, 30, 40.60, 120. 

To test a null hypothesis one computes U,,,,, and rejects the null 
hypothesis at significance level a if 


(42) <s [2 ~ +( Pp mer 7 1)| log Ui hich > Srey Cae fm et 


Since C, j,,,(@) > 1, the hypothesis is accepted if the left-hand side of (42) is 
less than x;,,( a). 

The purpose of tabulating C,, ,, ;(a) is that linear interpolation is reason- 
ably accurate because the entries decrease monotonically and smoothly to 1 
as M increases. Schatzoff (19664) has recommended interpolation for odd p 
by using adjacent even values of p and displays some examples. The table 
also indicates how accurate the y*-approximution is. The table has been 
extended by Pillai and Gupta (1969). 


8.4.5. A Step-down Procedure 


The criterion U has been expressed in (7) as the product of independent beta 
variables , V,,...,V,. The ratio V, is a least squares criterion for testing the 
null hypothesis that in the regression of x* —B*%Z, on Z=(Z) Z})’ and 


8.4 DISTRIBUTION OF THE LIKELIHOOD RATIO CRITERION 315 


X,;_1 the coefficient of Z, ts 0. The null hypothesis that the regression of X 
on Z, ts By, which is equivalent to the hypothesis that the regression of 
X—-BiZ, on Z, is 0, is composed of the hypotheses that the regression 
of x* — B*,Z, on Z, is 0,i=1,..., p. Hence the null hypothesis B, = By can 
be tested by use of V),...,V,. 


Since V, has the beta density (11) under the hypothesis B,, = B7,, 


1-Vin-i+l 
(43) “Vv om 


has the F-distribution with m and n—i+1 degrees of freedom. The step- 
down testing procedure is to compare (43) for i=1 with the significance 
point F,, ,(e,); if (43) for i= 1 is larger, reject the null hypothesis that the 
regression of xf — B7,Z, on Z, is 0 and hence reject the null hypothesis that 
B, = BI. If this first component null hypothesis is accepted, compare (43) for 
i=2 with F,,-;(€2). In sequence, the component null hypotheses are 
tested. If one is rejected, the sequence is stopped and the hypothesis B, = By 
is rejected. If all component null hypotheses are accepted, the composite 
hypothesis is accepted. When the hypothesis B, = Bj is true, the probability 
of accepting it is []?_,(1 — e,). Hence the significance level of the step-down 
test is 1-T]?2.,(1 — ¢;,). 

In the step-down procedure the investigator usually has a choice of the 
ordering of the variables’ (1.e., the numbering of the components of X) and a 
selection of component significance levels. It seems reasonable to order the 
variables in descending order of importance. The choice of significance levels 
will affect the power. If é,; is a very small number, it will take a correspond- 
ingly large deviation from the ith null hypothesis to lead to rejection. In the 
absence of any other reason, the component significance levels can be taken 
equal. This procedure, of course, is not invariant with respect to linear 
transformation of the dependent vector variable. However, before carrying 
out a step-down procedure, a linear transformation can be used to determine 
the p variables. 

The factors can be grouped. For example, group x,,...,x, into one fet 
and X,41,...,%, into another set. Then U, ,.,, = I1f.1V; can be used to tesi 
the null hypothesis that the first k rows of B, are the first k rows of Bi. 
Subsequently [17_,,,V, is used to test the hypothesis that the last p — & 1ows 
of B, are those of BT; this latter criterion has the distribution under the null 
hypothesis of U,_¢ in,n—k 


In some cases the ordering of variables may be imposed; for example, x, might be an 
observation at the first time point, x, at the second time point, and so on. 


316 TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 


The investigator may test the null hypothesis B, = Br by the likelihood 
ratio procedure. If the hypothesis is rejected, he may look at the factors 
Vagxe, V, to try to determine which rows of B, might be different from Bj. 

The factors can also be used to obtain confidence regions for 8,,,...,B,1. 
Let u,(e,) be defined by 


1 —u,(é,) oe ee 
(44) “Eley me ei C2) 


Then a confidence region for B,, of confidence 1 — e, is 


oe ad ae Gam x*Z! 
Xap) Xi) Xp-y Xp 2" 
5) gx" DX! ZZ' 
- (x? ~B,,Z,)(x? = B,,Z,)’ (xt = B,Z;)Xi- (x? ne BnZ,)Z5 
X,_1(x7 - Bi Z,)’ X,_- Xj- X12) 
Z,( x} —B,Z,)’ Z.Xi-1 Z,Z) 


X,- 1 Xj-1 Z;-123 
Z,Xi-1 ZX} 
1-1 ZZ’ 


. 


> u(e). 


8.5. AN ASYMPTOTIC EXPANSION OF THE DISTRIBUTION 
OF THE LIKELIHOOD RATIO CRITERION 


8.5.1. General Theory of Asymptotic Expansions 


In this sect.on we develop a large-sample distribution theory for the criterion 
studiea in this chapter. First we develop a general asymptotic expansion of 
the distribution of a random variable whose moments are certain functions of 
gamma functions [Box (1949)}. Then we apply it to the case of the likelihood 
ratio crite1ion for the linear hypothesis. 
We consider a random variable W (0 < W <1) with Ath moment' 
h 
(1) ewran| oer) Mier TL e(1 +h) + fe h=0,1,..., 
exe} THT [y (+h) +9] 


‘In all cases where we apply this result, the parameters x,, é;, y,, and 7, will be such that there 
is a distribution with such moments. 


8.5 ASYMPTOTIC EXPANSION OF DISTRIBUTION OF CRITERION 317 


where K is a constant such that ¢W°=1 and 


a b 
(2) py x, = 2», 


It will be observed that the Ath moment of A= cp , iS of this form 
where x, = 5N=y,, &, = 3(-q+1-h), i= 4(-q,+1-j), a=b=p, We 
treat a more general case here because applications later in this book require 
it. 

If we let 


(3) M= —2logW, 


the characteristic function of pM (0 <p <1) is 


(4) p(t) =é& etrM 
oo, ewrte 


-2u1 : 
_x{ Die) "TT Ly(1 = 2itp) + &] 
beiee * Dy, (1 ~2itp) +n] 


Here p is arbitrary; later it will depend on N. If a =b, x, =y,, & < 7, then 
(1) is the Ath moment of the product of powers of variables with beta 
distributions, and then (1) holds for all A for which the gamma functions 
exist. In this case (4) is valid for all real tr. We shall assume here that (4) holds 
for all real t, and in each case where we apply the result we shall verify this 


assumption. 
Let 
(5) (tr) = log (1) = g(t) —8(0), 
where 
a b 
g(t) =2itp| Yo x, log x, — »Y, log y, 
ke | yel 


+ log I px,(1 - 2ét) + By, + & J 


b 
- ui og! py,(1-2it) +e,+m], 


where B, = (1 ~ p)x, and e,=(1— p)y,. The form g(t) — g(0) makes ®(0) = 
0, which agrees with the fact that K is such that ¢(0)= 1. We make use of an 


318 TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 


expansion formula for the gamma function [Barnes (1899), p. 64] which is 
asymptotic in x for bounded h: 


(6) log (x +h) =logy2a + (x+h- j)logx—x 


(A 
~ ED Gels + Rail) 


where’ R,,,,(x)= O(x7-(*") and BA) is the Bernoulli polynomial of 
degree r and order unity defined by? 


(7) raz 7 LD Bh). 


The first three polynomials are [ B,(A) = 1] 
B(h) =h- 3, 

(8) Bh) =? -h+ i, 
B,(h) = ho — 3h? + Sh. 


Taking x = px,(1 — 2it), py,(1 — 2it) and A= 6, + &,, e, +, in turn, we 
obtain 


(9) (1) =Q-g8(0) — $f log(1 — 2ir) 


m a b 
+ Yw(1~2it) + Y O(ayi™*?) + y Oly ort), 
r=| k=[ 


where 
(10) f= -2&- Ey Ha-oy}, 
k j 
(-1)""! B.C B+ &) Sra 
11 yg = a Sot sh 
a : aE ( pX;) u ( py;) 


(12) Q= 3(4~b) log2a— 5f log p 
+ P(e + & — p)log xy, ~ L(y, + 7, — Z)log y,. 
k 


i 


"Rye (x) = O(2-™*)) means Lx *!R,, , (x) is bounded as [x] > ©, 

This definition differs slightly from that of Whittaker and Watson [(1943), p, 126], who expand 
ret? - 1)/(e7 -- 1), If B*(A) is this second type of polynomial, BA) = Bf (A) ~- 5, Bo (h)= 
Bt(h)+(~1)'*'B,. where B, is the rth Bernoulli number, and Bo, , ,(2) = BR. ,,(A). 


8.5 ASYMPTOTIC EXPANSION OF DISTRIBUTION OF CRITERION 319 


One resulting form for $(t) (which we shall not use here) is 


(13) h(t) =e = e281 ~ 2it) Ya, (1— it)? + Re 


mt 
v=0 


where L”_,a,z~” is the sum of the first m +1 terms in the series expansion 
of exp(—L™ ) w,z~"), and R*,,. is a remainder term. Alternatively, 


(14) B(t) = —4f log(1 — 2i) + 2 w,[(1—2it) "- 1] + Baars 


where 


(15) Reg = Ola) © 22 0(y *)): 
k j 


In (14) we have expanded g(0) in the same way we expanded g(t) and have 
collected similar terms. ‘ 
Then 


(16) (1) =e 


= (1-28) ¥ erp] Y w(1- 2it) 7 - y 01+ Ran 


r=l r=! 


u m -r -2r 
= (1 20) {TT [1+ y(t 222) + pw? (1 ~2it) A | 


xT] (1-o,+ xen | +Reai} 
= (1—2it) “(14+ 7,(t) +7, (t) + + Tq (t) + Ra], 


where T,(t) is the term in the expansion with terms w}! ++: w;’, Lis. =r; for 
example, 


(17) T(t) =w,[(1-2it)'~14], 
(18) p(t) = w|(1— 2i)? — 1] + dw? [(1 — 2ir) 7? - 2¢1 - it)" +4]. 


In most applications, we will have x,=c,6 and y,=d,6, where c, and d, 
will be constant and 6 will vary (i.e., will grow with the sample size), In this 
case if p is chosen so (1~p)x, and (1—)y,-have limits, then R”,,, is 
O(6-("*"), We collect in (16) all terms wf! +++ w‘*, Lis, =r, because these 
terms are O(67"). 


320 TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 
It will be observed that 7;(t) is a polynomial of degree r in (1 — 2it)”' and 
each term of (1 — 2it)~ 7/T\(t) is a constant times (1 — 2it)~ ®” for an integral 


v, We know that (1 — 2it)” ?” is the characteristic function of the y2-density 
with v degrees of freedom; that ts, 


(19) g.(z) = AG 


S(2)= [pei 2in) YT (year, 
(20) - | 
Ra = aC i 2it) 7 R" e'? dt, 


Then the density of pM is 
21) [ge betae= Fo 52) +R, 
=2,(z) + w,[8/.2(2) -8/(2)] 
+ {oa[ geal 2) ~87(2)] 


ae 
+ | Bp.4(2) ~ 28542(Z) +e,(2)]| 


+ +8, (z) +R% 


m+t° 


Za 
U,(2) = f S,(z) dz, 
(22) 
za. 
Rina = Ring a2. 


The cdf of M is written in terms of the cdf of pM, which ts the integral of 


3.5 ASYMPTOTIC EXPANSION OF DISTRIBUTION OF CRITERION 321 
the density, namely, 


(23) Pr{M<M,} 
= Pr{ pM < pM,) 


mM 


= 1 U( pMy) + Roar 


r=0 


= Pr{ xp< pM,| + wo(Pr{ Xj S pM,} - Pr{ xp < pM,}) 


+ w{Pr{ Nas S pM,} - Pr{ xp < pM,}) + s+ (Pr{ vee M,} 


—2Pr{ x74. 5 pMy} + Pri x7 < rm 
fives +U,( pM,) BER pr is 


The remainder R*,, is O(@~‘"*"); this last statement can be verified by 
following the remainder terms along. (In fact, to make the proof rigorous one 
needs to verify that each remainder is of the proper order in a uniform 
sense.) 

In many cases it is desirable to choose p so that w, =. In such a case 
using only the first term of (23) gives an error of order 67°. 

Further details of the expansion can be found in Box’s paper (1949). 


Theorem 8.5.1. Suppose that €W" is given by (1) for all purely imaginary 
h, with (2) holding. Then the cdf of —2plogW is given by (23). The error, 
RY ys is OC8-("*) if x, ec, 8, y, 24,8 (c,>U, d,>0), and if 1~-p)x,, 
(1 — p)y, have limits, where p may depend on 8. 


Box also considers approximating the distribution of —2plogl¥ by an 
F-distribution. He finds that the error in this approximation can be made to 
be of order 67? 


8.5.2. Asymptotic Distribution of the Likelihood Ratio Criterion 


We now apply Theorem 8.5.1 to the distribution of —2log A, the likelihood 
ratio criterion developed in Section 8.3. We let W=A. The Ath moment of A 
iS 


Tip. ,Pi4(N-g+1—-k+Nh)| 


2 gi = 7 
ee OM~ ATTN qn +1 7 + NA] 


322 TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 


and this holds for all # for which the gamma functions exist, including purely 
imaginary h. We let a=b=p, 


k= 4N, & =3(-q+1-k), B= 3(1- p)N, 


N 
y=3N,  y=a(-a@ti-s), = 3(1~p)N. 


We observe that 


(26) lw, 


Nl 


3 peeesag AP aga 
bull zeN 


= Hoe ae tk beatae 
3pN 


a] Pp 
oN 4 


k=) 


Se oe +4 
= 5 


il 


Jan l~21~p)N+2q.~24(p +1) +q, +2). 
To make this zero, we require that 


N-q,-3(p+q,+4+1 
(27) pe 2 2 P q, ue 


Then 
(28) Pr{ -255log r <7} 
=Pr{—klogU, 4, y-e <2} 
os Pr{ Xa, <2} 
8 73 (P| Nea <z} ~ Pr{ Xoa <7}) 


+ a [ ».{Pr{ Xpqes <z} - Pri Noa <z}} 


~ ¥3(Pr{ Xon+4 <z} ~ Pr{ Noa, <2})| +Rs, 


8.5 ASYMPTOTIC EXPANSION OF DISTRIBUTION OF CRITERION 323 


where 


(29) k=pN=N-q,-}(pt+q,t+l)=n-3(p—-4q,t)), 


pq,(p*+qi~5 
(30) ee ae) 


v3 
(31) y= 2 + ASS [3p* + 3af + 10pq? — 50( p? +g?) + 159]. 


Since A = uiN Fans Where n = N ~@, (28) gives Pr{ ~k log U, <z}. 


Pguan = 


Theorem 8.5.2. The cdf of —klogU,,, is given by (28) with k=n 


~3(p-—q,+1), and y, and y, given by (30) and (31), respectively. The 
remainder term is O(N~®). 


The coefficient k =n — 4(p—q,+1) is known as the Bartlett correction. 
If the first tern of (28) is used, the error is of the order N~?; if the second, 
N~‘; and if the third) N~®. The second term is always negative and is 
numerically maximum for z= y( pq, + 2 Pq;) (= pq, +1, approximately). 
For p = 3,q; 23, we have y,/k* <[(p? + at)/kl/96, and the contribution 
of the second term lies between —0.005[( p? + qg?)/kP and 0. For p>3, 
q, 2 3, we have y, < 3, and the contribution of the third term is numerically 
less than (y,/k?)*. A rough rule that may be followed is that use of the first 
term is accurate to three decimal places if p?+q? <k/3. 

As an example of the calculation, consider the case of p=3, q, =6, 
N-q)= 24, and z= 26.0 (the 10% significance point y%). In this case 
y2/k* = 0.048 and the second term is — 0.007: y,/k* = 0.0015 and the third 
term is —0.0001. Thus the probability of ~—19log U; , 1g < 26.0 is 0.893 to 
three decimal places. 

Since 


(32) ~[n-3(p- m + 1)]log uy mi n( @) = Compe (ON ae): 
the proportional error in approximating the left-hand side by x?,(a@) is 
C,m,n-p+1 ~ 1. The proportional error increases slowly with p and m. 


8.5.3. A Normal Approximation 


Mudholkar and Trivedi (1980), (1981) developed a normal approximation to 
the distribution of —log U, ,,,, which is asymptotic as p and es m — oo. It is 
telated to the Wilson— Hilferty normal approximation for the y*-distribution. 


"Box has shown that the term of order N~* is 0 and gives the coeffictents to be used in the term 
of order N~®, 


324 TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 


First, we give the background of the approximation. Suppose {Y,} ts a 
sequence of nonnegative random variables such that (Y, — 4,)/o, > N(O, 1) 
as k 00, where €Y, =u, and Y(Y,) = 0,2. Suppose also that 4, > oo and 
oO. / ty, is bounded as k > 00. Let Z, = (¥,/u,)". Then 


Z,71 _ Me(Z,~ 1) 1) £ 


oe h(o,/ ty) hoy 


4 N(0,1) 


by Theorem 4.2.3, The approach to normality may be accelerated by choosing 
h to make the distribution of Z, nearly symmetric as measured by its third 
cumulant. The normal distribution is to be used as an approximation and is 
justified by its accuracy in practice. However, it will be convenient to develop 
the ideas in terms of limits, although rigor is not necessary. 

By a Taylor expansion we express the Ath moment of Y,/p, as 


a 
by 

a 2 

(ic: Ae 1) of 

My 


pC Ct as el = o/b) 6 
al k 


(34) &Z, = s| 


where ¢, = &(Y, — w,)*/u,, assumed bounded. The rth moment of Z, is 
expressed by replacment of A by rh in (34). The central moments of Z, are 


(35) 

2 2(h— 1) 26, + (3h —5)(o2/py) 
6(z,- 1p nthe 4 AU) Be eee Ne: alee! +O( mi), 
(36) &(Z,- 1) =h poet A= DoR/m) O( ie 


ror 


To make the third moment approximately 0 we take A to be 


EY, — by)’ Me 
37 h, = 1 - -——————"_—. 
( ) 0 30 


Then Z,=(Y,/u,)"* is treated as normally distributed with mean and 
variance given by (34) and (35), respectively, with A = Ag. 


8.5 ASYMPTOTIC EXPANSION OF DISTRIBUTION OF CRITERION 325 


Now we consider —log U, m7» = —L/., logV,, where V,,...,V, are inde- 
pendent and V, has the density B(x;(n +1—i)/2,m/2), i=1,..., p. AS 
noo and moo, -logl, tends to normality. If V has the density 
B(x; a/2,b/2), the moment generating function of — log)’ is 


nttogy — F(a +b) /20(a/2=1) 
(38) fen > T(a7aP (a +b) 721] 


Its logarithm is the cumulant generating function. Differentiation of the last 
yields as the rth cumulant of V 


(39) C,=(-1) ‘ler a |- yr »( 57 )), ga oe eee 


where w(w) = d log I'(w)/dw. [See Abramovitz and Stegun (1972). p. 258, for 
example.] From ['(w + 1) = wI'(w) we obtain the recursion relation y(w + 1) 
= &(w) + 1/w. This yields for s = 0 and / an integer 


fe ae . 


a) HE) ~ gH =~ E —E 


The validity of (40) for s = 1,2,... is verified by differentiation. [The expres- 
sion for y’(Z) in the first line of page 223 of Mudholkar and Trivedi (198 1) is 
incorrect.) Thus for b = 2! 


t-] l 
41 Cero)! 22 
oe MEE) 2 Ga) 


From these results we obtain as the rth cumulant of — log U 


P.2han 


(42) K,(~logU, an) = 2’(r-1)! > > 


1 j= Foon 


As | > 0 the series diverges for r= 1 and converges for r = 2,3, and hence 
k,/K, 20, r = 2,3. The same is true as p > o (if n/p approaches a positive 
constant), 

Given n, p, and /, the first three cumulants arc calculated from (42). Then 
hy is determined from (37), and (~logU, 5;,,,)"" is treated as approximately 
normally distributed with mean and variance calculated from (34) and (35) 
for h = hy. 

Mudholkar and Trivedi (1980) calculated the error of approximation tor 
significance levels of 0.01 and 0.05 for n from 4 to 66, p=3,7, and 


326 TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 


qg = 2,6,10. The maximum error is less than 0,0007; in most cases the error is 
considerably less. The error for the y?-approximation is much larger, espe- 
cially for small values of 1. 

In case of m odd the rth cumulant can be approximated by 


p [ tln-3) 


(43) 2(r-1II ET] EE 


i} jeo (u—tt1~2))’ 


1 1 1 
oy = 7 |. 
(n—-i+m) 


Davis (1933. 1935) gave tables of w&(w) and its derivatives. 


8.5.4, An F-Approximation 


Rao (1951) has used the expansion of Section 8.5.2 to develop an expansion 
of the distribution of another function of U,,,,, in terms of beta distribu- 
lions. The consiants can be adjusted so that the term after the leading one is 
of order m~*. A good approximation is to consider 


1-U5 ks—r 
ui/s pm 


(44) 
as F with pm and ks —r degrees of freedom, where 
(45) s=y r= > -] 


and k isn - 3{p—m—1). For p=1 or 2 or m=1 or 2 the F-distribution is 
exactly as given in Section 8.4. If ks—r is not an integer, interpolation 
between two integer values can be used. For smaller values of m this 
approximalion is more accurate than the x*-approximation, 


$.6. OTILER CRITERLA FOR TESTING TILE LINEAR HYPOTHESIS 
8.6.1. Functions of Roots 


Thus far the only test of the linear hypothesis we have considered is the 
likelihood ratio test. In this section we consider other test procedures. 

Let Zo, Brg, and B,,, be the estimates of the parameters in N(Bz, ©), 
based on a sample of N observations. These are a sufficient set of statistics, 
and we shall base test procedures on them. As was shown in Section 8.3, if 
the hypothesis is B, = By, one can reformulate the hypothesis as B, = 0 (by 


8.6 OTHER CRITERIA FOR TESTING THE LINEAR HYPOTHESIS 327 


replacing x, by x, — Biz"). Moreover, 


(1) Bz, = B,2)) + B,z? 
= B (2? — Ay Azz?) + (Bo + Bi4p 47 )2e 


Ss I 2 
= B, 2% 4 Boz), 


where Dyz2z0'=0 and Ly ze Oz2’ = Ay. Then B,=B,. and Bt = 
B2.: 

We shall use the principle of invariance to reduce the set of tests to be 
considered. First, if we make the transformation X* =X, +z, we leave 
the null hypothesis invariant, since &X* = B,z*\? + (Bp + 0)2@ and BD + r 
is unspecified. The only invariants of the sufficient statistics are & and B, 
(since for each B*, there is a I that transforms it to 0, that is, — B*). 

Second, the nll hypothesis is invariant under the transformation zee 
Cz*“ (C nonsingular); the transformation carries B, to B,C~'. Under this 
transformation © and B,A,,.2B, ate invariant; we consider A,,.. as informa- 
tion relevant to inference. However, these are the only invariants. For 
consider a function of B, and A,,., say f(B,, Aj,.2). Then there is a C* that 
carries this into f(B,c*"t, I), and a further orthogonal transformation 
carries this into f(T,J), where t;,,=0, i<v, t,,>0. Uf each row of T is 
considered a vector in q,-space, the rotation of coordinate axes can be done 
so the first vector is along the first coordinate axis, the second vector is in the 
plane determined by the first two coordinate axes, and so forth). But T is a 
function of TT’ = B, A,,.2By; that is, the elements of T are uniquely deter- 
mined by this equation and the preceding restrictions. Thus our tests will 
depend on & and B,4,,..Bj. Let NZ=G and B,A,,..B, =H 

Third, the null hypothesis is invariant when x, is replaced by Kx,, for & 
and 3 are unspecified. This transforms G to KGK' and H to KHK’. The 
only invariants of G and H under such transformations are the roots of 


(2) |H —/G| = 


It is clear the roots are invariant, for 


(3) 0 = |KHK’ — IKGK'| 
= |K(H-/G)K'| 
= |K|-|H—IG|+|K'l. 


On the other hand, these are the only invariants, for given G and H there is 


328 TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 


a K such that KGK’ =J and 


1 0 0 
l, 0 

(4) KHK' = = . ’ 
0 0 I 


where /,> *++/, are the roots of (2). (See Theorem A.2.2 of the Appendix.) 


Theorem 8.6.1. Let x, be an observation from N(B,z*© + Biz, d), 
where Lz*z@2' =0 and L,z*%z*' = Aj... The only functions of the 
sufficient statistics and A,,, invariant under the transformations x* =x, + 
Tz, o** = Cz*™, and x* = Kx, are the roots of (2), where G = N& and 
H= Bi 4B, 


The likelihood ratio criterion is a function of 


5) yo GL__ __IKGK'| 
IG+H|  IKGK’+KHK’| [I +L| 


p 
=TTa +4)", 
f= 


which is clearly invariant under the transformations. 

Intuitively it would appear that good tests should reject the null hypothesis 
when the roots in some sense are large, for if B, is very different from 0, then 
B, will tend to be large and so will H. Some other criteria that have been 
suggested are (a) Li, (b) £/,/C1 +2), (c) max/,, and (d) min/;. In each case 
we reject the null hypothesis if the criterion exceeds some specified number. 


8.6.2. The Lawley—Hotelling Trace Criterion 


Let K be the matrix such that KGK'’ =I [G=K7'(K’)"', or G-'=K'K] 
and so (4) holds. Then the sum of the roots can be written 


fs) 
(6) Vo l,= tL = tr KAK’ 


i=1 
=trHK’'K=trHG™'. 
This criterion was suggested by Lawley (1938), Bartlett (1939), and Hotelling 


(1947), (1951). The test procedure is to reject the hypothesis if (6) is greater 
than a constant depending on p, m, and n. 


8.6 OTHER CRITERIA FOR TESTING THE LINEAR HYPOTHESIS 329 


The general distribution’ of tr HG~' cannot be characterized as easily as 
that of U, ,,,. In the case of p= 2, Hotelling (1951) obtained an explicit 
expression for the distribution of tr HG~' =/, +/,. A slightly different form 


of this distribution is obtained from the density of the two roots /, and /, in 
Chapter 13, It is 


(7) Pr{trHG™' <w} =p nzew(t — Lt 1) 

al |i(m+n—-1 Parra 
7 Sa ae y (i+w) eee Ly prew §L3 CM = 1) 3(1 ~ 1]. 
where [(a, b) is the incomplete beta function, that is, the integral of B(y: a. b) 
from 0 to x. 

Constantine (1966) expressed the density of tr HG~' as an infinite series 
in generalized Laguerre polynomials and as an infinite series in zonal 
polynomials; these series, however, converge only for trHG™' <1. Davis 
(1968) showed that the analytic continuation of these series satisfies a system 
of linear homogeneous differential equations of order p. Davis (1970a, 
1970b) used a solution to compute tables as given in Appendix B. 

Under the nul! hypothesis, G is distributed as £2_,Z,Z,, (an = N —q) and 
Hi is distributed as L3L,Y,Y,. where the Z, and Y, are independent, each 
with distribution N(0,Z). Since the roots are invariant under the previously 
specified linear transformation, we can choose K so that KK‘ =TI and let 
G* = KGK' [= L( KZ, KZ,.)'] and H* = KHK’. This is equivalent to assum- 
ing at the outset that 2 = J. 

Now 


| ee 
(8) plim yG= plim qn 2 D2. ZL=T. 


Nw@w 


This result follows applying the (weak) law of large numbers to each element 
of (1/n )G, 


1 
(9) plim ~ 52,2 er La eg Oy 
n+» t=] 


Theorem 8.6.2. Let f(H) be a function whose discontinuities form a set of 
probability zero when H is distributed as £1, YY, with the Y, independent, each 
with distribution N(Q,1). Then the limiting distribution of f(NHG~') is the 
distribution of f(H). 


Lawley (1938) purported to derive the exact distnbution, but the resul is in error 


330 TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 


Proof This is a straightforward application of a general theorem [for 
example, Theorem 2 of Chernoff (1956)] to the effect that if the cdf of X, 
converges to that of X (at every continuity point of the latter) and if g(x) is 
a function whose discontinuities form a set of probability 0 according to 
the distribution of X, then the cdf of g(X,,) converges to that of g(X). In our 
case X, consists of the components of H and G, and X consists of the 
components of H and I. a 


Corollary 8.6.1. The limiting distribution of N tt HG™! or ntr HG™! is the 
y “distribution with pq, degrees of freedom. 


This follows from Theorem 8.6.2, because 


P Pom 
(10) rH=DA,= 5 D¥2 
z=] v= 


i=] 


Ito (1956),(1960) developed asymptotic formulas, and Fujikoshi (1973) 
extended them. Let w, ,,.,(a@) be the @ significance point of tr HG™'; that is, 
(11) Pr{tr HG"! >, ,..(a)} =a, 


and let xZ(a) be the a-significance point of the y?-distribution with 
degrees of freedom. Then 


3 1[p+tm+l 
(12) AW y min &) = Xpm( &) + oa oper Xia 2) 


+({(p-mt xzaC)| +O(n~?). 


Ito also gives the term of order n~*. See also Muirhead (1970). Davis 
(1970a),(1970b) evaluated the accuracy of the approximation (12). Ito also 
found 


(13) Pr(ntr HG"! <z} =G,,,(z) - 1 Geos 


2n| pm+2 
+(p-mt 1) gm(2)| +O(n*), 


where G,(z) = Pri v7 <z) and g,(z)=(d/dz)G,(z). Pillai (1956) suggested 
another approximation to nw,,, (a), and Pillai and Samson (1959) gave 
moments of tr HG~’. Pillai and Young (1971) and Krishnaiah and Chang 
(1972) evaluated the Laplace transform of tr HG~' and showed how to invert 


8.6 OTHER CRITERIA FOR TESTING THE LINEAR HYPOTHESIS 331 


the transform. Khatri and Pillai (1966) suggest an approximate distribution 
based on moments. Pillai and Young (1971) suggest approximate distribu- 
tions based on the first three moments. 

Tables of the significance points are given by Grubbs (1954) for p = 2 and 
by Davis (1970a) for p =3 and 4, Davis (1970b) for p = 5, and Davis (1980) 
for p = 6(1)10; approximate significance points have been given by Pillai 
(1960). Davis’s tables are reproduced in Table B.2. 


8.6.3. Fhe Bartlett~Nanda—Pillai Frace Criterion 


Another criterion, proposed by Bartlett (1939), Nanda (1950), and Pillai 
(1955), is 


(14) V= Ee Seren 


= tr KHK'( KGK' + KHK') ' 
=trHK'[K(G+H)K'|"'K 
=trH(G +H)", 


where as before K is such that KGK’ =I and (4) holds. In terms of the roots 
f,=1/A41), i=1,..., p, of 


(15) |H —f(H+G)| =0, 


the criterion is LP, f;. In principle, the cdf, density, and moments under the 
null hypothesis can be found from the density of the roots (Sec. 13.2.3), 


P p Cee 

(16) cit sere OUT a fy EEL Gh) 
t= t= i< 

where 


(17) ©" Tanne) 


for 1>f, > °° > f, > 0, and 0 otherwise. lf m-p and n—p are odd, the 
density is a polynomial in f),..., f,. Then the density and cdf of the sum of 
the roots are polynomials. 

Many authors have written about the moments, Laplace transforms, densi- 
ties, and cdfs, using various approaches. Nanda (1950) derived the distribu- 
tion for p = 2,3,4 and m=p +1. Pillai (1954), (1956), (1960) and Pillai and 


332 TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 


Mijares (1959) calculated the first four moments of V and proposed approxi- 
mating the distribution by a beta distribution based on the first four mo- 
ments. Pillai and Jayachandran (1970) show how to evaluate the moment 
generating function as a weighted sum of determinants whose elements are 
incomplete gamma functions; they derive exact densities for some special 
cases and use them for a table of significance points. Krishnaiah and Chang 
(1972) express the distributions as linear combinations of inverse Laplace 
transforms of the products of certain double integrals and further develop 
this technique for finding the distribution. Davis (1972b) showed that the 
distribution satisfies a differential equation and showed the nature of the 
solution. Khatri and Pillai (1968) obtained the (nonnull) distributions in 
series forms. The characteristic function (under the null hypothesis) was 
given by James (1964). Pillai and Jayachandran (1967) found the nonnull 
distribution for p=2 and computed power functions. For an extensive 
bibliography see Krishnaiah (1978). 

We now turn to the asymptotic theory. It follows from Theorem 8.6.2 that 
nV or NV has a limiting y* distribution with pm degrees of freedom. 

Let vu, ,, ,(@) be defined by 


Poti n 
(18) Pr(trH(H+G) '2v,,,,,(@)} =a. 


Then Davis (1970a),(1970b), Fujikoshi (1973), and Rothenberg (1977) have 
shown that 


1 +mt+1 
(19) ny mn ©) = Xpm( &) + a | ara 


+(p—m+1)x,,(a)} + O(n), 
Since we can write (for the likelihood ratio test) 


: l d : 
(20) My yn &) = Xp @) + 57 (p— mt 1) xen Ce) +O(n Ss 
we have the comparison 


1 +m+1 = 
(21) 2M mul @) = Mp mil @) + Bae" eT Xom( @) + O(n), 


1 ptmti1 = 
(22) My m n( &) = Ny mn( &) + In pm +z Xm) + O(n oY 


86 OTHER CRITERIA FOR TESTING THE LINEAR HYPOTHESIS 333 


An asymptotic expansion [Muirhead (1970), Fujikoshi (1973)] is 
“rt 
(23) Pr{nV sz) =G,,(z) + 4° [(m—p—1)Gyn(Z) 


+2(p+1)Gyna2(Z) — (ptm +1)Gymag(Z)] +O(n-*), 


Higher-order terms are given by Muirhead and Fujikoshi. 


Tables. Pillai (1960) tabulated 1% and 5% significance points of V for 
p = 2(1)8 based on fitting Pearson curves (i.e., beta distributions with ad- 
justed ranges) to the first four moments. Mijares (1964) extended the tables 
to p=50. Table B.3 of some significance points of (mn +m)V/m= 
(1 /m) ALL /Gn + mG + H)}~! is from Concise Statistical Tables, and was 
computed on the same basis as Pillai’s. Schuurman, Krishnaiah, and 
Chattopodhyay (1975) pave exact significance points of V for p = 2(1)5; a 
more extensive table is in their technical report (ARL 73-0008). A compari- 
son of some values with those of Concise Statistical Tables (Appendix B) 
shows a maximum difference of 3 in the third decimal place. 


8.6.4. The Roy Maximum Root Criterion 


Any characteristic root of HG™' can be used as a test criterion. Roy (1953) 
proposed /,, the maximum characteristic root of HG™', on the basis of his 
union-intersection principle. The test procedure is to reject the null hypoth- 
esis if /, is greater than a certain number, or equivalently, if f, =/,/(1 +/)) 
= R is greater than a number +, ,,_,(a@) which satisfies 


PLAN 


(24) Pr{REry nin @)} = @. 


The density of the roots f,,..., f, for p<m under the null hypothesis ts 
given in (16). The cdf of R=f,, Pr{f, <f*}, can be obtained from the joint 
density by integration over the range O<f,< --- <f, <f*. If m-—p and 
n — p are both odd, the density of f|,..., f, is a polynomial; then the cdf of 
f, is a polynomial in f* and the density of f, is a polynomial. The only 
difficulty in carrying out the integration is keeping track of the different 
terms. 

Roy [(1945), (1957), Appendix 9] developed a method of integration that 
results in a cdf that is a linear combination of products of univariate beta 
densities and | eta cdfs. The cdf of f, for p = 2 is 


(25) Pr{f,<f) =1)(m-1,n-1) 


¥aT|4(m+n-—1)] Ltm—-1) {ta-1) ; 
-—Tm)rGay SEP yLe — Dace NL, 


334 TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 


This is derived in Section 13.5. Roy (1957), Chapter 8, gives the cdfs for 
p=3 and 4 also. 

By Theorem 8.6.2 the limiting distribution of the largest characteristic root 
of nHG"', NHG"', nH(H+G)"', or NH(H+G)°' is the distribution of 
the largest characteristic root of H having the distribution WU,m). The 
densities of ihc roots of H are given in Section 13.3. In principle, the 
marginal density of the largest root can be obtained from the joint density by 
integration, but in actual fact the integration is more difficult than that for 
the density of the roots of HG~' or H(H+G)"'. 

The literature on this subject is too extensive to summarize here. Nanda 
(1948) obtained the distribution for p = 2, 3, 4, and 5. Pillai (1954), (1956), 
(1965), (1967) treated the distribution under the null hypothesis. Other 
results were obtained by Sugiyama and Fukutomi (1966) and Sugiyama 
(1967). Pillai (1967) derived an appropriate distribution as a linear combina- 
tion of incomplete beta functions. Davis (1972a) showed that the density of a 
single ordered root satisfies a differential equation and (1972b) derived a 
recurrence relation for it. Hayakawa (1967),-Khatri ind Pillai (1968), Pillai 
and Sugivama (1969), and Khatri (1972) treated the noncentral case. See 
Krishnaiah (1978) for more references. 


Tables. Tables of the percentage points have been calculated by Nanda 
(1951) and Foster and Rees (1957) for p = 2, Foster (197) for p = 3, Foster 
(1958) for p = 4, and Pillai (1960) for p = 2(1)6 on the basis of an approxirna- 
tion. [See also Pillai (1956), (1960), (1964), (1965), (1967).] Heck (1960) pre- 
sented charts of the significance points for p = 2(1)6. Table B.4 of signifi- 
cance points of n/,/m is from Concise Statistical Tables, based on the 
approximation by Pillai (1967). 


8.6.5. Comparison of Powers 


The four tests that have been given most consideration are those based on 

Wilks’s U, the Lawley-Hotelling W, the Bartlett-Nanda~Pillat V, and Roy’s 

R. To guide in the choice of one of these four, we would like to compare 

power functions. The first three have been compared by Rothenberg on the 

basis of the asymptotic cxpansions of their distributions in the nonnull case. 
Let v},..., »,' be the roots of 


(26) (B, - Bl) 41..(B, — Bl)’ ~ v2] =0. 


The distribution of 


(27) tr (Bio — BY) 41, (Big — BY)’ 27? 


8.6 OTHER CRITERIA FOR TESTING THE LINEAR HYPOTHESIS 335 


is the nonceniral y?-distribution with pm degrees of freedom and noncen- 
trality parameter L?_,v,.. As Noo, the quantity (1/m)G or (1/N)G 
approaches & with probability one. If we let N — oo and Aj,,.. is unbounded, 
the noncentrality parameter grows indefinitely and the power approaches 1. 
It is more informative to consider a sequence of alternatives such that the 
powers of the different tests are different. Suppose B, = B}’ is a sequence of 
matrices such that as N — 00, (BY — BT)4,,..(B) — BY)’ approaches a limit 
and hence v}\,..., vy’ approach some limiting values »,,..., v,, respectively. 
Then the limiting distribution of NtrHG™', ntrHG', NtrH(H+6)")}, 
and ntrH(H+G)7! is the noncentral y?-distribution with pm degrees of 
freedom and noncentrality pararseter L?_,v, Similarly for -N logU and 
~nlog U, 
Rothenberg (1977) has shown under the above conditions that 


(28) Pr{Us<u,,, (a)}=1-G 


pm 


Xpm({ &) 


p 
>| 
i=l 
1 p 
ms aG +m = 1) X ay peer | Xpn( @)| 
9 2 1 

Bz XL Vv, Eomsol XZ =)]] +=], 

(29) Pr{trHG!>w, ,, .(«)} 


= 1 = Gan| Xom( ©) 


Pp 

Es 
1 Pp 

“a5 (p +m-+1) d Pees Xant &)| 


P 
oe » PF Bpmsee| Xpm( oc) 
i=l 


P a ; 
ptmt+1 : 
+, p- Bemt(y »| fra ite velar) 


336 TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 
(30) PritrH(H +6)" 24, m.n(a)} 


ee | 2 
ol pm Xo @) 


af 7 


P 
“(0 +m+1) y Vi8pm+4| Xpm(&)| 


p 
+ X V} Bpmse| Mat a) 


i pm+2 


Pp 
+ Et ean! 
i r=1 


» °) ewe xo] +o(h), 


where G,(xly) is the noncentral x *.distribution with f degrees of freedom 
and Soneenivality parameter y, and g,(x) is the (central) y?-density with f 
degrees of freedom. The leading terms are the noncentral y*-distribution; 
the power functions of the three tests agree to this order. The power 
functions of the two trace tests differ from that of the likelihood ratio test by 
+8 pma8l Xpmh a )1/(2n) times 


(31) Bap pam fa © (v5)? - Peet) 52 
i=] i 


pm+2 


where p= LP, v,/p. This is positive if 


(32) 7\ ee 


where o,?=L? ,(v,-%)’/p is the (population) variance of , . ..5¥,; the 
left-hand side of (32) is the coefficient of variation. If the »,’s are relatively 
variable in the sense that (32) holds, the power of the Lawley—Hotelling trace 
test is greater than that of the likelihood ratio test, which in turn Is greater 
than that of the Bartlett-Nanda-Pillait trace test (to order 1/n); if the 
inequality (32) is reversed, the ordering of power is reversed. 

The differences between the powers decrease as n increases for fixed 
Dysac vydps (However, this comparison is not very meaningful, because increas- 
ing n decreases B; — By and increases Z'Z.) 

A number of numerical comparisons have been made. Schatzoff (1966b) 
and Olson (1974) have used Monte Carlo methods; Mikhail (1965), Pillai and 
Jayachandran (1967), and Lee (1971a) have used asymptotic expansions of 


8.7 TESTS AND CONFIDENCE REGIONS 337 


distributions. All of these results agree with Rothenberg’s. Among these 
three procedures, the Bartlett~Nanda-—Pillai trace test is to be preferred if 
the roots are roughly equal in the alternative, and the Lawley~Hotelling 
trace is more powerful when the roots are substantially unequal. Wilks’s 
likelihood ratio test seems to come in second best; in a sense it is maximin. 

As noted in Section 8.6.4, the Roy largest root has a limiting distribu- 
tion which is not a y? distribution under the null hypothesis and is not a 
noncentral y?-distribution under a sequence of alternative hypotheses. Hence 
the comparison of Rothenberg cannot be extended to this case. In fact, the 
distributions .n the nonnull case are difficult to evaluate. However, the 
Monte Carlo results of Schatzoff (1966b) and Olson (1974) are clear-cut. 
The maximum root test has greatest power if the alternative is one-dimen- 
sonal, that is, if vy. = --- = », =. On the other hand, if the alternative is not 
one-dimensional, then the maximum root test is inferior. 

These test procedures tend to be robust. Under the null hypothesis the 
limiting distribution of B, — B] suitably normalized is normal with mean 0 
and covariances the same as if X were normal, as long as its distribution 
satisfies some condition such as bounded fourth-order moments. Then 3. = 
(1/N)G converges with probability one. The limiting distribution of each 
criterion suitably normalized is the same as if X were normal. Olson (1974) 
Studied the robustness under departures from covariance homogeneity as 
well as departures from normality. His conclusion was that the two tracc tests 
and the likelihood ratio test were rather robust, and the maximum root test 
least robust. See also Pillai and Hsu (1979). 

Berndt and Savin (1977) have noted that 


(33) 1wH(H+G)"' <logU"' <trHG"'. 


(See Problem 8.19.) If the x? significance point is used, then a larger 
criterion may lead to rejection while a smaller one may not. 


8.7. TESTS OF HYPOTHESES ABOUT MATRICES OF REGRESSION 
COEFFICIENTS AND CONFIDENCE REGIONS 


8.7.1. Testing Hypotheses 


Suppose we are given a set of vector observations x,,..., x, with accompany- 
ing fixed vectors z,,...,Z,y, Where x, is an observation from N(Bz,.2). We 
let B=(B, B,) and zi, = (2, 2"), where B, and z!)" have g, (=q -q-) 
columns, The null hypothesis is 


(1) H:B, = By, 


338 TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 


where Bj is a specified matrix. Suppose the desired significance level is a. A 
test procedure is to compute 


NX 
(2) U= | za 
INS. 
and compare this number with u,, ,(a@), the @ significance point of the 


U,, ¢,.a-distribution. For p = 2,...,10 and even m, Table 1 in Appendix B can 
be used. For m= 2,...,10 and even p the same table can be used with m 
replaced by p and p replaced by m. (M as given in the table remains 
unchanged.) For p and mi both odd, interpolation betwcen even values of 
either p or 1 will give sufficicnt accuracy for most purposes. For reasonably 
large nu. the asymptotic theory can be used. An equivalent procedure is to 
calculate Pr{U, ,,.,, SU); if this is less than a, the null hypothesis is rejected. 
Alternatively one can use the Lawley-Hotelling trace criterion 


(3) W=tr(NZ,-NZq)}(NEq) 

=r (Bia - BI) Ay (Bio = Br)'(N2o) 
the Pillai trace criterion 
(4) V=t(NE,-N2q)(N%,) 

= (Bia ~ BY) 4y.2(Bio~ Bi)'(NE.) 
or the Roy maximum root criterion R, where R is the maximum root of 
(5) NZ, a NEq = TNE | 7 (Bio = Bt) 4412(Bio = By) = rN &o|= 0 


These criteria can be referred to the appropriate tables in Appendix B. 

We outline an approach to computing the criterion. If we let y, =x, — 
Biz‘). then y, can be considered as an observation from N(Az,, &), where 
A=(A, A,)=(B,—-B) By). Then the null hypothesis is H: A, = 0, and 


(6) Va Ie = LX xX Xa - BIC, -C,Bi' + Bi, te 
(7) Vek, =C- Bi(An 41). 


Thus the problem of testing the hypothesis B, = B) is equivalent to testing 
the hypothesis A, = 0, where @y, = Az,. Hence let us suppose the aie 
is testing the hypothesis B,=0. Then NS, = Lx, x',— Bo,4yB5, a 


8.7 TESTS AND CONFIDENCE REGIONS 339 


NS, = Dx, x',- By ABy. We have discussed in Section §.2.2 the computa- 


tion of B, 4B, and hence N3&o. Then B.., AnBs, can be computed in a 
similar manner If the method is laid out as 


Bro | _ 
Bio 

the first g. rows and columns of A* and of A** are the same as the result of 
applying the forward solution to the left-hand side of 


(8) 


Ay 2) 


(9) AnBs. =C), 


and the first g, rows of C* and C** are the same as the result of applying 
the forward solution to the right-hand side of (9). Thus B,,, 42Bi, SOC 
where C*’ =(C*’ C*") and C**’ =(€3*" C**"), 

The method implies a method for computing a determinant. In Section 
A.5 of the Appendix it is shown that the result of the forward solution is 
FA = A*. Thus |F| -|A| = |A*|. Since the determinant of a triangular matrix 
is the product of its diagonal elements, |F| = 1 and |A| = |A*| =I1?_, a7. 
This result holds for any positive definite matrix in place of A (with a suitable 
modification of F) and hence can be used to compute [NZq| and IN¥,I. 


8.7.2. Confidence Regions Based on U 


We have considered tests of hypotheses B, = Bi, where Bj is specified. In 
the usual way we can deduce from the family of tests a confidence region for 
B,. From the theory given before, we know that the probability is 1— a of 
drawing a sample so that 

IN Sol 
IN Eg + (Big ~ Bi) 4y1.2(Bin - B,)’| 


Thus if we make the confidence-region statement that B, satisfies 


(10) 


> Up g,n( o)> 


En 
(11) iN . )' Pa OP OF 


where (11) is interpreted as an inequality on B, = B,, then the probability is 
1~ a of drawing a sample such that the statement is true. 


Theorem 8.7.1. The region (11) in the B,-space is a confidence region for 
B, with confidence coefficient 1 — a. 


340 TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 


Usually the set of B, satisfying (11) is difficult to visualize. However, the 
inequality can be used to determine whether trial matrices are included in 
the region. 


8.7.3. Simultaneous Confidence Intervals Based on the 
Lawley—Hotelling Trace ~ 


Each test procedure implies a set of confidence regions. The Lawley—Hotell- 
ing trace criterion can be used to develop simultaneous confidence intervals 
for linear combinations of elements of B,. A confidence region with confi- 
dence coefficient 1 — a is 


(12) tr(Bin~ Bt) An2(Bin— Bi) (N Za) | <M). m.n( 4): 

To derive the confidence bounds we generalize Lemma 5.3.2. 
Lemma 8.7.1. For positive definite matrices A and G, 

(13) [tr ®’¥| < VtrA-'@’G@ vir aY'G'Y. 
Proof Let b=tr ®’Y/trA~'®'G®. Then 


(14) 0 <trA(Y¥-bG@A~')'G"'(Y-bG®A™') 
=trAY’G 'Y¥—btr@®'yY—btr¥'o+h*tr®'Gea,"! 


rey)’ 
: tA @ G® 


which yields (13). | 
Now (12) and (13) imply that 


(15) tr ®'B,. — tr ®'B, | =|tr ® (Bio = B,)| 


A — 


< trAp,@’NS_@-trAy.2(Byo -B))'(N Za) (Bio -B,) 
holds for all p Xm matrices ®. We assert that 
(16) ©) - Nea OE, 6 Yw,,.fa) <1 OF, 
<tr ®'Big + YN tr Ag ®’'Sn® Ym, nl @) 


holds for all ® with confidence 1 — a. 


87 TESTS AND CONFIDENCE REGIONS 341 


The confidence region (12) can be explored by use of (16) for various ®. If 
;, =1 for some pair (J, K) and 0 for other elements, then (16) gives an 
interval for B,,. lf $,,=1 fora pair (/,K), —1 for (/, L), and 0 otherwise, 
the interval pertains to B,,—8,,, the difference of coefficients of two 
independent variables. If ¢,, = 1 for a pair (/,K), —1 for (J.K). and 0 
otherwise, one obtains an interval for 8). — B;x, the difference of coeffi- 
cients for two dependent variables. 


8.7.4. Simultaneous Confidence Intervals Based on the Roy Maximum 
Root Criterion 


A confidence region with confidence 1— a based on the maximum root 
criterior. is 


(17) ch, (Bio “i B, )41.2(Bro cS B,)'(NZ_) ST pminl ®); 


where ch,(C) denotes the largest characteristic root of C. We can derive 
siinultaneous confidence bounds from (17). From Lemma 5.3.2, we find for 
any Vectors @ and b 


(18) 


[a (Bia ~ B,)o] = {|(B te — B,) ‘al’ b)” 
< |( Bi B,) B,)'2]'4u1.2[( Big — B,)‘a] +b’; 
“(BaD bu= 2 sega 


Ss ch,|(Bio = Bi) Ay (Bin 7 B,)'G"'| ‘a'Ga-b'Aj' sb 
sr Bum, ey a'Ga-b' Ajab 
with probability 1 — a; the second inequality follows from Theorem A.2.4 of 


the Appendix. Then a set of confidence intervals on all linear comhinations 
a'B,b holding with confidence | — a is 


(19) a’ Bib ry mn( @)°@'Gab'A, 3b <a’ Bb 


The linear combinations are a'B,b=L?_,L7_.4,B,,5,. lf a,=1. a,=0. 
i#1, and 5,=1, 6,=0, A+#1, the linear combination is simply £,,. It 

a;=0,1%1, and b, =1, b, = —1. b, =0, h #1.2. the linear combi- 
nation is B); — B;2- 


342 TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 


We can compare these intervals with (16) for ® = ab’, which is of rank 1. 
The term subtracted from and added to tr ®’B,,, = @’B,, b is the square root 
of 


(20) Won .a( @) tt Ay! ba’Gab’ =w, ,,,(@)-@'Ga-b'Ajy2d. 

This is greater than the term subtracted and added to a’B,,b in (19) because 
Veh: ,(@), pertaining to the sum of the roots, is greater than Pe es, ni @), 

sae to one root. The bounds (16) hold for all p Xm matrices ®, while 

(19) holds only for matrices ab’ of rank 1. 

Mudholkar (1966) gives a very general method of constructing simultane- 
ous confidence intervals based on symmetric gauge functions. Gabriel (1969) 
relates confidence bounds to simultaneous test procedures, Wijsman (1979) 
showed that under certain conditions the confidence sets based on the 
maximum root are smallest. [See also Wijsman (1980).] 


8.8. TESTING EQUALITY OF MEANS OF SEVERAL NORMAL 
DISTRIBUTIONS WITH COMMON COVARIANCE MATRIX 


In univariate analysis it is well known that many hypotheses can be put in the 
form of hypotheses concerning regression coefficients. The same is true for 
the corresponding multivariate cases. As an example we consider testing the 
hypothesis that the means of, say, q normal distributions with a common 
covariance matrix are equal. 

Let y? be an observation from N(w?, 2), a=1,...,N,, 2=1,...,q. The 
null hypothesis is 


(1) Hip = =p, 
To put the problem in the form considered earlier ir this chapter, let 
(2) X= (4,220 Fy, Xuan Fw) = [y? yy “WP yD > yy? } 


with N=N, +++ +N,. Let 


(3) Z=(z 22 1 2y, Smart 7 Zu) 
Plow TL 0 0 

0 0 = O 1 0 

[0 GO = 0 O + oO 

0 0 0 0 0 


8.8 TESTING EQUALITY OF MEANS 343 


that is, z,,=1lif Nj te +N) <a<5N,+ °° +N, and z;,=0 otherwise, 
for i=1,...,¢—1, and z,,=1 (all a), Let B = (B, By), where 


= {y() — (9) (q-l) _ (9) 
=(B wp we), 

(4) 
B, =p. 


Then x, is an observation from N(Bz,, %), and the null hypothesis is B, = 0. 
Thus we can use the above theory for finding the criterion for testing the 


hy pothesis. 
We have 
N, 0 0 N, 
Oo N N. 
‘d 2 : 2 
(5) A= )) 2,2%4= aml 
a=l 0 0 N= 1 No-1 
N, Ny oo N,-1 N 


N 
(6) C=) 25205 ( ey Lye Lya7P Lx). 
a a i,a 


a~| 
Here 4,=N and C,=¥,,y. Thus By, =E, ys? (1/N) =9, say, and 
(7) NBQ= Lxg es —FNy 

= Ly yf! — Nyy’ 


i,a 


= L(y (OF) (y 5)". 


For &_, we use the formula N23, =x, x, - ~ By ABy = Lx, x, —CA'C’ 
Let 


1 0 0 0 
0 1 0 0 
(8) eS ; 
0 0 1 0 


344 TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 


then 
1 0 a 0 
O 1 0 06 
(9) patel _ 
0 l 
11 | 
Thus 
(10) CA7'C'=CD'D™!'47'D™ DC’ 
= CD'( DAD’)! DC’ 
NO > Of Lyf 
2 0 H 
a pes Age rx) ; : 
, 2 Se are 
0 0 = Ny} | Le 
s ae “ryt 
-¥[Eary Do] 
t a ¥ 
sue) Dra Sake ae 
where ¥ =(1/N)Z, yi. Thus 
(11) N Eg = Ly’ — CN FOV’ 


1,@ 


= 5 (9? ~9)(9 JPY. 
i,a 


It will be seen that &, is the estimator of © when pw!) = «- = pw and &, 
is the weizhted average of the estimators of Y based on the separate 
samples. 

When the null hypothesis is true, |V Sol /IN,| is distributed as Ug ine 
where n = N — q. Therefore, the rejection region it the « significance level is 


IN Ea 
INS, | 


w 


(12) <i g-tnf &)- 


88 TESTING EQUALITY OF MEANS 345 


The left-hand side of (12) is (11) of Section 8.3, and 


(13) NEL-NE_ = Vy yl’ — Noy’ — | LyP yl?” 


i,a 


= YN yy’ 


SN =a) yr) oH, 


as implied by (4) and (5) of Section 84. Here H has the distribution 
W(Z,q— 1). It will be scen that when p= lt, this test reduces to the usual 
F-test 


EN(G-5) on 


4 
(4) (902 — pry’ q-1 


> Fata): 


We give an example of the analysis. The data are taken from Barnard’s 
study of Egyptian skulls (1935). The 4 (= q) populations are Late Predynastic 
(i = 1), Sixth to Twelfth (i = 2), Twelfth to Thirteenth (i = 3), and Ptolemaic 
Dynasties (i = 4). The 4 (=p) measurements (i.e., componenis of y‘"’) are 
maximum breadth, basialveolar length, nasal height, and basibregmatic height. 
The numbers of observations are N, = 91, N, = 162, N, = 70, N,= 75. The 
data are sumn arized as 


(15) (5 yp 5) y) 
133.582418 134.265432 134.371429 135.306667 
98.307692 96.462963 95.857143 95.040000 
50.835165 51.148 148 = 50.100000 52.093 333 }° 
133.000000 134.882716  133.642857 131.466 667 
(16) N&o 
9661.997470  445.573301 1130.623900 2148.584 210 
— | 445.573301 9073.115027 1239.211990 = 2255.812 722 
1130.623900 = 1239.211990 3938.320351 = 1271.054 662 |" 


2148 .584 210 


From these data we find 


(17) NX, 


9785 .178098 

214.197 666 
1217.929 248 
2019.820 216 


2255.812 722 


214.197 666 
9559 .460 890 
1131.716 372 
238 1.126 040 


1271.054 662 


1217.929 248 
1131.716 372 
4088.731 856 
1133.473 898 


8741 .508 829 


2019.820 216 
238 1.126 040 


1.133.473 898 | 


9382242 720 


346 TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 


We shall use the likelihood ratio test. The ratio of determinants is 


ie INZql _ 2.4269054 x 10° 
INE]  2.9544475 x 10° 


(18) = 0.£21434 4. 


Here N = 398, n = 394, p= 4, and q = 4. Thus k = 393. Since n is very large, 
we may assume —k log U, 4 39, is distributed as x? with 12 degrees of 
freedom (when the null hypothesis is true). Here —k log U = 77.30. Since the 
1% point of the x/,-distribution is 26.2, the hypothesis of po? = p® = pO = 
pw is rejected.! 


8.9. MULTIVARIATE ANALYSIS OF VARIANCE 


The univariate analysis of variance has a direct generalization for vector 
variables leading to an analysis of vector sums of squares (i.e., sums such as 
Lix,x,). In fact, in the preceding section this generalization was considered 
for an analysis of variance problem involving a single classification. 

As another example consider a two-way layout. Suppose that we are 
interested in the question whether the column effects are zero. We shall 
review the analysis for a scalar variable and then show the analysis for a 
vector variable. Let i i=1,...,r, j=l,..-,c, be a set of re random 
variables. We assume that 


(1) fY =eta te, bel eiag hy) fe Tyee, 


with the restrictions 


(2) ee eon 


i=1 j=l 


that the variance of Y,, is 0”, and that the Y,, are independently normally 


? 


distributed. To test that column effects are zero is to test that 
(3) v.=0, j=l,...jc. 


This problem can be treated as a problem of regression by the introduction 


"The above computalions were given by Bartle (1947). 


8.9 MULTIVARIATE ANALYSIS OF VARIANCE 347 


of dummy fixed variates. Let 


(4) Zoi; = 1, 
Ze0.i, = 1; k =i, 
=0, k #1, 
Zon; = 1, k=, 
= (), k#f 


Then (1) can be written 
r ¢ 


(5) EY = bzw t+ Lo Agze0,4 + LS YEZOK,y- 
* k=] k=1 


The hypothesis is that the coefficients of zo, ,, are zero. Since the matrix of 
fixed variates here, 


Zon 200 re 

Zio 210 ,re 
(6) Zmiu 7 Za e |, 

Zoe 200, re 


is singular (for example, row O0 is the sum of rows 10,20,...,70), one must 
elaborate the regression theory. When one does, one finds that the test 
criterion indicated by the regression theory is the usual F-test of analysis of 
variance. 


Let 
1 
Y= ra : ays 
us 
1 
(7) a c LY, 


348 TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 


and let 
a=D(¥,-¥,-¥j+¥,) 
ivf 


= 2 2 2 2 
=DYi-cLv2-rLyt+ev? 
in] i } 
(8) 
b=rh(¥,,-¥,) 


} 


St ey 
; 
Then the F-statistic is given by 


(9) F = 


Under the null hypothesis, this has the F-distribution with c — 1 and (r — 1): 
(c — 1) degrees of freedom. The likelihood ratio criterion for the hypothesis 
is the rc /2 power of 


(10) a+b 14 ((e-l/[(r-D(e- DF 


Now let us turn to the multivariate analysis of variance. We have a set of 
p-dimensional random vectors Yi i=1,...,7, J=l,...,c, with expected 
values (1), where p, the A’s, and the w’s are vectors, and with covariance 
matrix &, and they are independently normally distributed. Then the same 
algebra may be used to reduce this problem to the regression problem. We 


define ¥_,¥,,¥ , by (7) and 
A= er ae ee eras ¥ )(Y, a eae ae me Y.)' 
te) 


= DYN oh N Wr EY Yi +r ¥!, 
iy i i 
(11) 
Berd (1) 1.) 
/ 


=r DIY ¥! —rc¥_¥'. 


} 


8.9 MULTIVARIATE ANALYSIS OF VARIANCE 349 


Table 8.1 


Varieties 
Location V 


A Statistic analogous to (10) is 


| A| 
(12) [A+B ° 


Under the null hypothesis, this has the distribution of U for p,n=(r— 1): 
(c — 1) and g, =c — 1 given in Section 8.4. In order for A to be nonsingular 
(with probability 1), we must require p < (r- 1Xc — 1). 

As an example we use data first published by Immer, Hayes, and Powers 
(1934), and later used by Fisher (1947a), by Yates and Cochran (1938), and by 
Tukey (1949). The first component of the observation vector is the barley 
yield in a given year; the second component is the same measurement made 
the following year. Column indices run over the varieties of barley, and row 


indices over the locations. The data are given in Table 8.1 [e.g., a in the 


upper left-hand corner indicates a yield of 81 in each year of variety M in 
location UF |. The numbers along the borders are sums. 
We consider the square of (147, 100) to be 


21,609 14,700 


14.700 10,000)” 


| 147 
100 


\ 147 100) = 


350 TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 


Then 

(3) Erin (sisae. es | 
oe rio MOR (Guwe emse 
oe EK, )8H)"= (TSeoras 1355727) 
oe (20r,)¢20¥,)' = (BES 240 1812.05 | 


Then the error sum of squares is 


(17) ss fee ee 


802 4017)’ 


the row sum of squares is 


, {18,011 7.188 
(18) su, Y_)(¥,-¥.) -| 7,188 10345) 


and the column sum of squares is 


(19) 


pa (2788 2550 
2550 2863)" 


The test criterion is 


3279 ~—- 802 
|Al 802 4017 
2 = —————_. = (},4107, 
(20) |4 +B 6067 3352 : 
3352 6880 


This result is to be compared with the significant point for U2 4 2). Using the 


result of Section 8.4, we see that 


1 — ¥0.4107 Glo ee 
¥0.4107 4 


is to be compared with the significance point of F, 4. This is significant at 
the 5% level. Our data show that there are differerces between varieties. 


8.9 MULTIVARIATE ANALYSIS OF VARIANCE 351 


Now let us see that each F-test in the univariate analysis of variance has 
analogous tests in the multivariate analysis of variance. In the linear hypothe- 
sis model for the univariate analysis of variance, one assumes that the 
random variables Y),...,¥y have expected values that are linear combina- 
tions of unknown parameters 


(21) EY, = Vo Belgas 
& 


where the ’s are the parameters and the z’s are the known coefficients. The 
variables {Y,} are assumed to be normally and independently distributed with 
common variance a. In this model there are a set of linear combinations, 
say LN. ¥:.¥,, where the ’s are known, such that 


2 


(22) ay 


N N 
S Via Ya = = Cire def 
a«=1 B= 


is distributed as oy? with n degrees of freedom. There is another set of 
linear combinations, say L, bp Yay, where the #’s are known, such that 


m N 2 N 
23) r= ¥ | y t%| - 3 chy, 


B=1 


is distributed as o2y* with m degrees of freedom when the null hypothesis is 
true and as ? times a noncentral y? when the null hypothesis is not true; 
and in either case 6 is distributed independently of a. Then 


(24) 


has the F-distribution with m and n degrees of freedom, respectively, when 
the null hypothesis is true. The null hypothesis is that certain B’s are zero. 

In the multivariate analysis of variance, Y),...,¥, are vector variables with 
p components. The expected value of Y, is given by (21) where B, is a vector 
of p parameters. We assume that the {Y,} are normally and independently 
distributed with common covariance matrix &. The linear combinations 
Lyi¢¥, can be formed for the vectors. Then 


? 


N 
om ViaYa ~ 


a=] 


(25) A=} dap¥,¥% 


N 
iy Via a 


N 
a, B=1 


352 TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 


has the distribution W(Z, a). When the null hypothesis is true, 


m N N ‘ N 
(26) B= | all ye by%| = LY cap¥ats 
g=1 \a=1 a=] a, pH=l 


has the distribution W(Z, 7m), and B is indeper dent of A. Then 


|Al = |o4.,¥,¥, 
l4+Bl | Vdip¥¥ + LeapYaYs | 


(27) 


has the U, ,,,-distribution. 

The argument for the distribution of a and 6 involves showing that 
OE La¥iata=9 and €L,¢,,¥, = 0 when certain B’s are equal to zero as 
specified by the null hypothesis (as identities in the unspecified ’s). Clearly 
this argument holds for the vector case as well. Secondly, one argues, in the 
univariate case, that there is an orthogonal matrix W =(y,,) such that when 


the transformation ¥, = L, %%.Z, is made 


n 
a= ye dap Yay Ups Z, Z; a » Le 


a, B,y,6 
(28) atm 
b= oy Crp Wry Ys Ly Zs ae > Ze 
a, B.y.6 a=ntl 


Because the transformation is orthogonal, the {Z,} are independently and 
normally distributed with common variance a7. Since the Z,, a=1,...,n, 
must be linear combinations of L,7;,¥, and since Z,, a=n+1,....n +m, 
must be linear combinations of L,,.¥,, they must have means zero (under 
the null hypothesis). Thus a/a* and b/o? have the stated independent 
y?-distributions. 

In the multivariate case the transformation ¥, = L4W%aZq is used, where 
¥, and Z, are vectors. Then 


a 
A= Yo dag Yay¥ps2,25= L 2,24, 
a=i1 


a,B,y,6 
(29) nem 
B= VY cag Yay ps2Z,25= L 2.2, 
a,B,y,6 a=ntl 


because it follows from (28) that Ly gdog tay %ps=1, y= On, and =0 
otherwise, and La gCap Yay %ps= 1, n+1lsy=Ssn+m, and =0 other- 
wise. Since W is orthogonal, the {Z,} are independently normally distributed 


8.10 SOME OPTIMAL PROPERTIES OF TESTS 353 
with covariance matrix ‘:. The same argument shows Z,=0, a=1,.... 
n+m, under the null hypothesis. Thus A and 8 are independently dis- 
tributed according to W(Z, nm) and WCE, m), respectively. 


8.10. SOME OPTIMAL PROPERTIES OF TESTS 


8.10.1. Admissibility of Invariant Tests 


In this chapter we have considered several tests of a linear hypothesis which 
are invariant with respect to transformations that leave the null hypothesis 
invariant. We raise the question of which invarjant tests are good tests. In 
particular we ask for admissible procedures, that is, procedures that cannot 
be improved on in the sense of smaller probabilities of Type I and/or Type 
1] error. The competing tests are not necessarily invariant. Clearly, if an 
invariant test is admissible in the class of all tests, it is admissible in the class 
of invariant tests. 

Testing the general lincar liypothesis as treated here is a generalization of 
testing the hypothesis concerning one mean vector as treated in Chapter 5. 
The invariant procedures in Chapter 8 are generalizations of the T*-test. 
One way of showing a procedure is admissible is to display a prior distribu- 
tion on the parameters such that the Bayes procedure is a given test 
procedure. This approach requires some ingenuity in constructing the prior. 
but the verification of the property given the prior is straightforward. Prob- 
lems 8.26 and 8.27 show that the Bartlett~Nanda-—Pillaj trace criterion V and 
Wilks’s likelihood ratio criterion LU’ yield admissible tests. The disadvantage 
of this approach to admissibility is that one must invent a prior distribution 
for each procedure; a general theorem does not cover many cases. 

The other approach to admissibility is to apply Stein’s theorem (Theorem 
5.6.5), which yields general results. The invariant tests can be stated in terms 
of the roots of the determinantal equation 


(4) |H—(H+G)|=0, 


where H = B,4),.B, = WW) and G=N¥_ = WW. There is also a matrix 
B, (or W;) associated with the nuisance parameters B,. For convenience, we 
define the canonical form in the following notation. Let W,=X (pm). 
Wy,=Y (pXr),W,=Z(pXn), €X=5, €Y=H, and &Z=0: the columns 
are independently normally distributed with covariance matrix £. The null 


hypothesis is = = 0, and the alternative hypothesis is = + 0. 
The usual tests are given in terms of the (nonzero) roots of 


(2) | XX’ — A(ZZ' + XX‘)| =| XX' - acu - ¥¥")| =0, 


354 TESTING THE GENERAL LINEAR HYPULHESIS; MANUVA 


where U=XX' + YY’ + ZZ’. Expect for roots that are identically zero, the 
roots of (2) coincide with the nonzero characteristic roots of X'(U — Y¥’)71X. 
Let F=(X,Y,U) and 


(3) M(V) =X'(U—Yy¥')'X. 
The vector of ordered characteristic roots of M(V) is denoted by 
(4) (Ajs.ce Am) =ACM(P)), 


where A, > «+ 2A, >0. Since the inclusion of zero roots (when m > p) 
causes no trouble in the sequel, we assume that the tests depend on 
ACM(V)). 

The admissibility of these tests can be stated in terms of the geometric 
characteristics of the acceptance regions. Let 

RZ ={AER™A, BA, 2“ DA, = OF, 
(5 

R™={NER™"A, 20,...,4, 29). 


It seems reasonable that if a set of sample roots !eads to acceptance of the 
null hypothesis, then a set of smaller roots would .is well (Figure 8.2), 


Definition 8.10.1. A region ACR% is monotone if XGA, vER®, and 
vp, <A, i=1,...,m, imply v EA. 


Definition 8.10.2. For ACR the extended region A* ts 


(6) A*= (J eres Xaimy) |x EA}, 


7 


where a ranges over all permutations of (1,...,). 


Figure 8,2, A monotone acceptance region, 


8.10 SOME OPTIMAL PROPERTIES OF TESTS 355 
The main result, first proved by Schwartz (1967), is the following theorem: 
Theorem 8.10.1. Jf the region ACR’ is monotone and if the extended 

region A* is closed and convex, then A is the acceptance region of an admissible 


test. 


Another characterization of admissible tests is given in terms of majoriza- 
tion, 


Definition 8.10.3. A vector X=(A,,...,A,,)’ weakly majorizes a vector 
v=(y,....4,)' if 
(7) Atay & Mpnys Atay + Aggy = Mpa + payee > At Foe Atay = Py #2 FM mp 
where dei and Vip = 1,...,m, are the coordinates rearranged in nonascending 
order, 


We use the notation A>,wv or »v~<,A if A weakly majorizes v. If 
A, v&R%, then A>,» is simply 
(8) ApSVy, Apt AQzev, +V2,..., Apt HA, SVp ton +yv 


If the last inequality in (7) is replaced by an equality, we say simply that A 
majorizes v and denote this by A> wv or vy <A. The theory of majorization 


and the related inequalities are developed in detail in Marshall and Olkin 
(1979). 


Definition 8.10.4. A region A CR is monotone in majorization if A EA, 
v ER", v~< Xd imply v EA. (See Figure 8,3.) 


Theorem 8.10.2. Jf a region ACR" is closed, convex, and monotone in 
majorization, then A is the acceptance region of an admissible test. 


de 


v 


Figure 8.3. A region monotone in majorization. 


356 TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 


Theorems 8.10.1 and 8.10.2 are equivalent; it will be convenient to prove 
Theorem 8.10.2 first. Then an argument about the extreme points of a 
certain convex set (Lemma 8.10.11) establishes the equivalence of the two 
theorems. 

Theorem 5.6.5 (Stein’s theorem) will be used because we can write the 
distribution of (X,Y, Z) in exponential form. Let U = XX’ + YY’ + ZZ’ = (u,,) 
and £~'=(o'). For a general matrix C=(c,,...,¢,), let vec(C)= 
(cj,...,¢,)’. The density of (X, Y, Z) can be written as 


(9) f(X,¥,Z) =K(2,H, ©) exp{tr B’27'X+ eH’ E7'Y—- dtr E7'U} 
= K(2,H, ©) exp( oy + Oa + OX}, 


where K(X,H, ¥) is a constant, 


= vec( 27 'E), (2) = vec(=~'H), 


0 3) = ele Qo ej 2o Ppa or’), 
(10) 


Yay = vec(X), Ya) = vec(Y), 


> , 
J(3) = (Uy ps U yay 00+ Uy ys a9,.++,Upy) , 


If we denote the mapping (X, ¥, Z) +y = (y{1y» Yay No) by & Y= XY, Z), 
then the measure of a set A in the space of y is m(A) = p(g7’(A)), where 
jz is the ordinary Lebesgue measure on R*™*'*"), We note that (X,Y, U) is 
a sufficient statistic and so is y =(y(1), %2)) Hs)’. Because a test that is 
admissible with respect to the class of tests based on a sufficient statistic 
is admissible in the whole class of tests, we consider only tests based on a 
sufficient statistic. Then the acceptance regions of these tests are subsets in 
the space of y. The density of y given by the right-hand side of (9) is of the 
form of the exponential family, and therefore we zan apply Stein’s theorem. 
Furthermore, since the transformation (X,Y,U)—y is linear, we prove the 
convexity of an acceptance region of (X,Y,U). The acceptance region of an 
invariant test is given in terms of ACM(V))=(A,,...,A,,)’. Therefore, in 
order to prove the admissibility of these tests we have to check that the 
inverse image of A, namely, A={V|X(M(V)) &A), satisfies the conditions 
of Stein’s theorem, namely, is convex. 

Suppose V, = (X;, X,,U;) © A, i = 1,2, that is, ALM(V,)] € A. By the convex- 
ity of A, pA[M(V,)] + qgALM(V,)] © A for 0<p=1-—q <1. To show pV, + 
qV, <A, that is, \[M( pV, + qV2)] © A, we use the property of monotonicity 
of majorization of A and the following theorem. 


8.10 SOME OPTIMAL PROPERTIES OF TESTS 357 


Ag 


AD = ACM(K,)) 


N= A(M(K)) 


pail) + gal2) 


AUM(pY, + qV¥o)) 


Figure 8.4. Theorem 8.10.3. 


Theorem 8.10.3. 
(11) A[M( pV, + gh2)] >, PAL M(V,)] +A[M(V,)]. 


The proof of Theorem 8.10.3 (Figure 8.4) follows from the pair of 
majorizations 


(12) A[M( pV, +4¥2)] > ,¥[ pPM(V,) + M(¥,)] 
> wPALM(V,)] +qa[M(¥2)}. 
The second majorization in (12) is a special case of the following lemma. 


Lemma 8.10.1. For A and B symmetric, 
(13) A(A+B)>, ACA) +A(B). 


Proof. By Corollary A.4.2 of the Appendix, 


k 
(14) LA(4 +B) = max trR'(A+B)R 
f=] eek 


< max trR'AR+ max trR’BR 
R'R=T, R’R=1, 


i 
i> 


k 
1=] 


I 
pai 


= Le {a(4) +4,(B)), kK=1,....p. = 


—_ 


358 TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 
Let 4>B mean A-8 is positive definite and A>B mean A-B Is 
positive semidefinite. 
The first majorization in (12) follows from several lemmas. 
Lemma 8.10.2 
(15) pU, + qU, ~ ( p¥, + q¥1)( p¥, + q¥5)’ 
=p(U, —-¥,¥;) +q(U, — ¥,¥5). 
Proof. The left-hand side minus the right-hand side is 
(16) pY,¥, +4¥,¥3 — p?Y,¥, — q?¥,¥2 — pq(¥,¥2 + ¥2¥7) 
= PUL —Py¥Y, +q(1— 4) ¥o¥3 ~ pq(Y¥s + ¥7) 
= pq(Y, —¥,)(¥, — ¥2)' = 0. a 


Lemma 8.10.3. [f4>B>0, then A'' <B'. 
Proof. See Problem 8.31. a 
Lemma 8.10.4. [If A > 0, then f(x, A) =x'A7'x is convex in (x, A). 
Proof. See Problem 5.17. | 
Lemma 8.10.5. [f A, >0, A,>0, then 
(17) (PB, +4B,)'( pA, +942) '( pB, + gB,) <pB,Ay'B, + qB,Az'B). 
Proof. From Lemma 8.10.4 we have for all y 
(18) py'B\A,'B,y + qy'Bi Az'Boy 
-y'( pB, + qB,)'( pA, +4A;) ‘(pB, + qBy)y 
=p(Byy)’Ay'( By) +4(Bsy)'Az (Boy) 


—( pB,y + qBry)'( pA, + qd) '(pB,y +qB,y) 
> 0. | 


Thus the matrix of the quadratic form in y is positive semidefinite. | 


The relation as in (17) is sometimes called matrix convexity. [See Marshall 
and Olkin (1979).] 


8.10 SOME OPTIMAL PROPERTIES OF TESTS 359 


Lemma 8.10.6. 
(19) M( pV, + qV,) spM(V,) + @M(P2), 


=1-q<l. 


Proof. Lemmas 8.10.2 and 8.10.3 show that 
(20) [ pU, + QU, — ( PY, + q¥n)( p¥, + 4¥2)') 
<[p(U,- YY) +4(U,~ %¥2)| 
This implies 
(21) M( pV, + 4V,) 
< ( pX, + 4X2)'[ p(U; — ¥,¥[) + q(U, - Yo¥3)]~'( pX, + qX)). 


Then Lemma 8.10.5 implies that the right-hand side of (21) is less than or 
equal to 


(22) pX;(U, — ¥,¥{)~'X, +4gX3(U, — ¥,¥3)'X, =pM(V,) + qM(V). 


a 
Lemma 8.10.7. Jf A <B, then X(A) <,A(B). 
Proof. From Corollary A.4.2 of the Appendix, 
k k 
(23) ¥ a,(A) = max trR’AR< max trR'BR= ¥ A,(B), 
Pe! R'R=1, R'R=1, Par 
k=1,...,p. a 


From Lemma 8.10.7 we obtain the first majorization in (12) and hence 
Theorem 8.10.3, which in turn implies the convexity of A. Thus the accep- 
tance region satisfies condition (i) of Stein’s theorem. 


Lemma 8.10.8. For the acceptance region A of Theorem 8.10.1 or Theorem 
8.10.2, condition (ii) of Stein’s theorem is satisfied. 


Proof. Let » correspond to (®, W, @); then 


(24) DY = D1) Yr) + Dia) Hay + M3) Hs 


=tr@'X+tr W'Y— 5tr OU, 


360 TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 


where © is symmetric. Suppose that {ylw’y>c) is disjoint from A= 
{V|ACM(V)) €A). We want to show that in this case @ is positive semidefi- 
nite. If this were not true,,then 

I 0 0 

0 -F QO], 


(25) @=D 
0 0 0 


where D is nonsingular and —J is not vacuous. Let X=(1/y)X,, Y= 
C1/y)¥o, 


1 0 0 
(26) U=(D’)'|0 yl O}bD", 
0 0 7 


and V=(X,Y,U), where X,,¥, are fixed matrices and y is a positive 
number. Then 


, i , {7 0 0 
(27) w’y= tr @'X,+—trW'Y,+ str) 0 yl 0] >c 
y 0" 'y a) 
0 0 0 
for sufficiently large y. On the other hand, 
(28) A(M(V)) =A{X'(U - Y¥')'X) 
7 { I 00 , = 
= Xi/(D')'|0 yl 0|D-—Y¥] X, 
a 0 oF 


-0 


as y>. Therefore, VEA for sufficiently large y. This is a contradiction, 
Hence @ is positive semidefinite. 

Now let o, correspond to (®,,0,/), where ®, #0. Then 1+ A® is 
positive definite and ®, + AM + 0 for sufficiently large A. Hence w, + Awe 
Q—), for sufficiently large a. | 


The preceding proof was suggested by Charles Stein, 

By Theorem 5.6.5, Theorem 8.10.3 and Lemma 8.10.8 now imply Theorem 
8.10.2. 

To obtain Theorem 8.10.1 from Theorem 8.10.2, we use the following 
lemmas. 


Lemma 8.10.9. ACR% is convex and monotone in majorization if and 
only if A is monotone and A* is convex. 


8.10 SOME OPTIMAL PROPERTIES OF TESTS 361 


Ag 


(O, A;) 
@v extreme points 


A = (Aq, Ag) 


Figure 8.5 


Proof. Necessity. If A is monotone in majorization, then it is obviously 
monotone. A* is convex (see Problem 8.35). 


Sufficiency. For X ER let 


C(A) = {alx ERM x> A}, 
(29) 
D(A) = {xlx ERD, x> A}. 


It will be proved in Lemma 8.10.10, Lemma 8.10.11, and its corollary that 
monotonicity of A and convexity of A* implies C(A)CA*. Then D(A) = 
CIA) AR™CA* NR =A. Now suppose v €R@ and v~<,A. Then ve 
D(A) CA. This shows that A is monotone in majorization. Furthermore, if 
A* is convex, then A = R% MA* is convex. (See Figure 8.5.) z= 


Lemma 8.10.10. Let C be compact and convex, and let D be convex. If the 
extreme points of C are contained in D, then C CD. 


Proof. Obvious. a 
Lemma 8.10.11. Every extreme point of C(A) is of the form 


(30) (Seip Agiiineess Ona t ea) 


where a is a permutation of (1,...,m) and 6, = + =6,=1, 6,,,= -" =8 
= for some k, 


Proof. C(X) ts convex. (See Problem 8.34.) Now note that C(A) is permu- 
tation-s) mmetric, that is, if (x,,....¥,,)’ €C(A) then (4, ,))..--.. x 
C(A) for any permutation 7. Therefore, for any permutation 7. 7(C(A)) = 


362 TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 


Geers: Vouny) X= C(A)} coincides with C(A). This implies that if 
Ckieece x,,)' is an extreme point of C(A), then (x,,1),...,%y¢..))’ is also an 
extreme point. In particular, (x;),---,{,.)) = R’ is an extreme point. Con- 
versely, if (x\....,%,,) ERY is an extreme point of C(A), then 
Caer arr yi is an extreme point. 

We see that once we enumerate the extreme points of C(A) in R%, the 
rest of the extreme points can be obtained by permutation. 

Suppose xR. An extreme point, bemg the intersection of m hyper- 
plaues. has to satisfy »: or more of the following 2m equations: 


E,ix, =0 Fyix, =A, 
E, vr, = 0, Fyix; +x.=A +A, 
(31) 
Ve =0) Fioixy tee ty, SH Alte tA 


Suppose that k is the first index such that E, holds, Then x © R% implies 
O=*, 2X4, 2° Bx, 20. Therefore, E,,...,£,, hold. The remaining 
k-]=m—(m-—k +1) or more equations are among the F’s. We order 
them as F,...,F,. where i,-< -: <i,, /2k—1. Now i, < -- <i, implies 
i; 2/ with equality if and only if /, =1,...,é,=/. In this case F),..., Fy_, 


hold (/ 2 k - 1), Now suppose i, >/. Since x, = ++ =x,, =0, 
(32) fee ree HX, HAP be FAQ Ho HA; 
Butexy Fore SA and we have A, ++: +A, = 0. There- 
fore, U= A, +t tA, BAe BA, 20. In this case Fy_, are reduce 
to the same equation x, +-+- tx,_, =A, to +A,_). It follows that x 


satisfies k —2 more equations, which have tc be Fi,...,F,.,. We have 
shown that in either case E,,...,£,,,F,,...,/,_, hold and this gives the 


pomt B=(A,,....A,_,,0,...,0), which i isin R% ACCA). Therefore, 8 is an 
extreme point. a 


Corollary 8.10.1. C(A)CA*. 


Proof. If A is monotone, then A* is monotone in the sense that if 
A=(A,...., A) €A*, v=(y,,...,4,). » <A, 1=1,...,m, then vEA*. 
(See Problem 8.35.) Now the extreme points of C(A) given by (30) are in A* 
because of permutation symmetry and monotonicity of A*. Hence, by Lemma 
8.10.10. C(A) CA*. a 


8.10 SOME OPTIMAL PROPERTIES OF TESTS 363 


Proof of Theorem 8.10.1. Immediate from Theorem 8.10.2 and Lemma 
8.10.9. a 


Application of the theory of Schur-convex functions yields several corollar- 
ies to Theorem 8.10.2 


Corollary 8.10.2. Let g be continuous, nondecreasing, and convex in (0, 1). 
Let 


(33) FOR) =F(Arr An) =D BC): 


Then a test with the acceptance region A = {A|f(X) < c} is admissible. 


Proof. Being a sum of convex functions f is convex, and hence A is 
convex. A is closed because f is continuous. We want to show that if 
f(x) <cand y <,,x (x, y ER), then fly) <c. Let ¥, = DL, x, J, = Ley, 
Then y~,,x if and only if %,2>9,, k=1,...,m. Let f(x) =ACE,,..., 8, ) = 
g(£,) + L7, g(%; —X,_,). It suffices to show that A(¥,,..., ¥,,) 1S increasing 
in each x,. For i<m— 1 the convexity of g implies that 


(4). “Altace kt Pend a) Alten epee) 


=29(x, +2) —9(%,) — (2( 4,41) —2(4%41 —£)} 20. 


For i=m the monotonicity of g implies 
(35) A(z,,...,%, t@) —a(¥,,--., 4m) =2(%, + &) —2(x,,) = 0. - 


Setting g(A)= —log(l— A), g(A)=aA/— A), g(A) =A, respectively, 
shows that Wilks’ likelihood ratio test, the Lawley~Hotelling trace test, and 
the Bartlett-Nanda-Pillai test are admissible. Admissibility of Roy’s maxi- 
mum root test A: A, <c follows directly from Theorem 8.10.1 or Theorem 
8.10.2. On the contrary, the minimum root test, 4, <c, where ¢ = min(m, p), 
does not satisfy the convexity Condition. The following theorem shows that 
this test is actually inadmissible. 


Theorem 8.10.4. A necessary condition for an invariant test to be admissible 
is that the extended region in the space of yA, ,..., yA, i8 convex and monotone. 


We shall only sketch the proof of this theorem [following Schwartz (1967)]. 
Let A; =d,, i=1,...,t, and let the density of d,,...,d, be f(dlv), where 
v=(y,,...,»,)' is defined in Section 8.6.5 and f(dlv) is given in Chapter 13. 


364 TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 


The ratio f(dlv)/f(d|0) can be extended symmetrically to the unit cube 
(0<d;<1,i=1,...,t). The extended ratio is then a convex function and is 


Strictly increasing in each d;. A proper Bayes procedure has an acceptance 
region 


(36) [aap an() <e, 


where !I(v) is a finite measure on the space of v’s. Then the symmetric 
extension of the set of d satisfying (36) is convex and monotone [as shown by 
Birnbaum (1955)], The closure (in the weak* topology) of the set of Bayes 
procedures forms an essentially complete class [Wald (1950)]. In this case the 
limit of the convex monotone acceptance regions is Convex and monotone. 
The exposition of admissibility here was developed by Anderson and 
Takemura (1982), 


8.10.2. Unbiasedness of Tests and Monotonicity of Power Functions 


A test T is called unbiased if the power achieves ifs minimum at the null 
hypothesis. When there is a natural parametrization and a notion of distance 
in the parameter space, the power function is monotone if the power 
increases as the distance between the alternative hypothesis and the null 
hypothesis increases. Note that monotonicity implies unbiasedness. In this 
section we shall show that the power functions of many of the invariant tests 
of the general linear hypothesis are monotone in the invariants of the 
parameters, namely, the roots; these can be considered as measures of 
distance. 

To introduce the approach, we consider the acceptance interval (—a, a) 
fer testiig the null hypothesis ~ =0 against the alternative «+0 on the 
basis of an observation from N(z, 07). In Figure 8.6 the probabilities of 
acceptance are represented by the shaded regions for three values of yu. It is 
clear that the probability of acceptance decreases monotonically (or equiva- 
lently tle power increases monotonically) as 4 moves away from zero. In 
fact, this property depends only on the density function being unimodal and 
symmetric. 


-a p=0 a -a Opa -a0 az 


Figure 8.6. Three probabilities of acceptance. 


8.10 SOME OPTIMAL PROPERTIES OF TESTS 365 


Figure 8.7. Acceptance regions. 


In higher dimensions we generalize the interval by a symmetric convex Set, 
and we ask that the density function be symmetric and unimodal in the sense 
that every contour of Constant density surrounds a convex set. In Figure 8.7 
we illustrate that in this case the probability of acceptance decreases mono- 
tonically, The following theorem is due to Anderson (1955b). 

Theorem 8.10.5. Let E be a convex set in n-space, symmetric about the 
origin. Let f(x) 2 0 be a function such that (i) f(x) — f(—x), (ii) {xl f(x) = u} = 
K,, is convex for every u (0 <u <), and (iii) [, f(x) dx < x. Then 


(37) [ flaky) dee f f(x ty) de 
E E 
for0<k<1. 

The proof of Theorem 8.10.5 is based on the following lemma. 

Lemma 8.10.12. Let E, F be convex and symmetric about the origin. Then 
(38) V{(E+b) OF) 2V{(E+y) nF), 
where 0<k <1 and V denotes the n-dimensional volume. 

Proof. Consider the set a(E+y)+(l-a\WE-y)=aE+(1-aE+ 
(2a~@—1)y which consists of points a(x +y)+(1- aXz—y) with x,2EF 
Let a, = (k + 1)/2, so that 2a, - 1 =k. Then by convexity of E we have 
(39) ao(E+y) + (1-a,)(E-y) CE+h. 


Hence by convexity of F 


ap[(Ety) AF] +(1-ay)[(E-y) OF] C(E+m) NF 


366 TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 
and 
(40) V(agl(E+y) AF) +(1- a) [(E-y) ON F]} sV((E+ky) OF}. 


Now by the Brunn- Minkowski inequality je.g., Bonnesen and Fenchel (1948), 
Section 48], we have 


(41) V"{agl(E ty) NF] +(1— a) [(£-9) O FI} 
paV’"{(E+ty) ANF) +(1-a,)Vi"{(E-y) NF} 
=aV/"((E+y) NF) +(l—a)V“"{(-E+y) (-F)} 
=V'/"((E+y) OF). 

The Jast equality follows from the symmetry of E and F. a 


Proof of Theorem 8.10.5. Let 


(42) H(u) =V((E+ky) 1K,}, 
(43) H*(u) =V{(E +y) NK,}. 
Then 

(44) [fl ty) a= ff) de 


ma 
= dive u) dudx 
leach sus sa) 


= J, lsu snail) dxdu 
= fH (u) du. 
0 
Similarly, 
45 fees 2 re 
(45) [ flat) fA) ii 


By Lemma 8.10.12, H(u) => H*(u), Hence Theorem 8.10.5 follows from (44) 
and (45). a 


8.10 SOME OPTIMAL PROPERTIES OF TESTS 367 
We start with the canonical form given in Section 8.10.1. We further 
simplify the problem as follovs. Let f= min(m, p), and let v,,...,¥, (v4, = v2 


> + > v,) be the nonzero characteristic roots of Z'S~'=, where = = &X. 


Lemma 8.10.13. There exist matrices B (p X p) and F (m Xm) such that 


(46) 


where D, = diag(v,,..., v ). 


Proof. We prove this for the case p<m and y, > 0. Other cases can be 
proved similarly. By Theorem A.2.2 of the Appendix there is a matrix 8 such 
that 


(47) BYB'=I, BEE'B'=D,. 
Let 

(48) F,=D>:BE (pXm). 
Then 

(49) FF, =1,. 


Let F’ =(F), F,) be a full mm orthogonal matrix. Then 


(50) BHF, = DiF,F, = 

and 

(51) BEF =BE( Fi, F,) =B=E(E'B'D;?, F,) =(Di,0). : 
Now let 

(52) U=BXF', V=BZ, 


Then the columns of U,V are independently normally distributed with 
covariance matrix { and means when p<m 


&U = (D?,0), 
é&v=0. 


(53) 


368 TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 


Invar‘ant tests are given in terms of characteristic roots /;,...,/,U, 2 -: 21) 
of U(V¥')"'U. Note that for the admissibility we used the characteristic 
roots of A; of U'(UU' + V¥')~'U rather than J, = a,/(1 — A,). Here it is more 
natural to use /;, which corresponds to the parameter value »;. The following 
theorem is given by Das Gupta, Anderson, and Mudholkar (1964). 


Theorem 8.10.6. Jf the acceptance region of an invariant test is convex in 
the space of each column vector of U for cach set uf fixed values of V and of the 
other column vectors of U, then the power of the test increases monotonically in 
each v.. 


Proof. Since UU’ is unchanged when any column vector of U is multiplied 
by —1, the acceptance region is symmetr’c about the origin in each of the 
column vectors of U. Now the density of U=(1,,), V = (o,,) is 


(54) f(U,V) 


-lnem) 1 ; J 2 Less 
= (20) Et? exp — 5 \trWw +E (uv) +hhe 
y#e 


Applying Theorem 8.10.5 to (54), we see that the power increases monotoni- 
cally in each y'y, . : 


Since the section of a convex set is convex, we have the following corollary. 


Corollary 8.10.3. If the acceptance region A of an invariant test is convex in 
U for each fixed V, then the power of the test increases monotonically in each v 


From this we see that Roy’s menmun root test A:/,<K and the 
Lawley—Hotelling trace test A:tr U'(VWV')"'U < K have power functions that 
are monotonically increasing in cach »,. 

To see that the acceptance region of the likelihood ratio test 


(55) A: [l(1+l) <K 


i=] 


satisfies the condition of Theorem 8.10.6 lct 


(56) 


8.10 SOME OPTIMAL PROPERTIES OF TESTS 369 


Then 


(57) [1a +h)=|owy'y 4t|=|e'u* 41 


=|U*U*' +7 |=|atat' +B 
= (ut'Bolat + 1)|B| 
= (a, T'B™'Ta, + 1}IBI, 


where B=asue' ++ tu*a™’ +2, Since T’B 'T is positive definite, (55) is 
convex in u,. Therefore, the likelihood ratio test has a power function which 
is Monotone increasing in each »,. 


The Bartlett—Nanda- Pillai trace test 
(58) A: tr U'(UU' + VV’) U2 tae 


has an acceptance region that is an ellipsoid if K <1 and is convex in each 
column u, of U provided K <1. (See Problem 8.36.) For K > 1 (58) may not 
be convex in each colimn of U. The reader can work out an example for 
p=2. 

Eaton and Perlman (1974) have shown that if an invariant test is convex in 
U and W = VV’, then the power at (v/,..., v;) is greater than at (,..... v dif 
(YP, yee VO) < ply v9, 0. VoD. We shall not prove this result. Roy's 
Maximum root test and the Lawley—Hotelling trace test satisfy the condition. 
but the likelihood ratio and the Bartlett-Nanda-—Pillai trace test do not. 

Takemura has shown that if the acceptance region is convex in U and W, 
the set of yen for which the power is not greater than a constant is 
monotone and convex. 

It is enlightening to consider the contours of the power function. 
Mf, 0. Yu Theorem 8.10.6 does not exclude case (a) of Figure 8.8. 


Vv Vy Vor 
(a) (b) (c) 


Figure 8.8. Contours of power functions. 


370 TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 


and similarly the Eaton—Perlman result does not exclude (b). The last result 
guarantees that the contour looks like (c) for Roy’s maximum root test and 
the Lawley—Hotelling trace test. These results relate to the fact that these 
two tests are more likely to detect alternative hypotheses where few v,’s are 
far from zero. In contrast with this, the likelihood ratio test and the 
Bartlett-Nanda—Pillai trace test are sensitive to the overall departure from 
the null hypothesis. It might be noted that the convexity in Yv-space cannot 
be translated into the convexity in space. 

By using the noncentral density of /,’s which depends on the parameter 
values v),...,¥,, Perlman and Olkin (1980) showed that any invariant test 
with monotone acceptance region (in the space of roots) is unbiased. Note 
that this result covers all the standard tests considered earlier. 


8.11. ELLIPTICALLY CONTOURED DISTRIBUTIONS 


8.11.1. Observations Elliptically Contoured 


The regression model of Section 8.2 can be written 
(1) x, = Bz, +4, a=1],...,N, 


where e, is an unobserved disturbance with &e,=0 and &e,e) =. We 
assume that e, has a density | Al ~ #g(e’A~!e); then ¥ =(&R?/p)A, where 
R? =e',A’'e,. In gencral the exact distribution of B= D%_,x,2z/,47' and 
NE =. (x, — Bz, Xx, - Bz,)' is difficult to obtain and cannot be ex- 
pressed concisely. However, the expected value of B is B, and the covariance 
matrix of vec B is 2 ®A~' with A= L¥_,z,z',. We can develop a large~ 
sample distribution for B and Ni. 


Theorem 8.11.1. Suppose (1/N)A >A), 2,2, < constant, a=1,2,..., 
and either the e,’s are independent identically distributed or the e,’s are indepen- 
dent with é'le’,e,|°** < constant for some e> 0. Then B ~ B and VN vec( B - 
B) Aas a limiting normal distribution with mean 0 and covariance matrix 
Y@A;!. 


Theorem 8.11.1 appears in Anderson (1971) as Theorem 5.5.13. There are 
many alternatives to its assumptions in the literature. Under its assumptions 
> 9 > =. This result permits a large-sample theory for the criteria for testing 
null hypotheses about B. 

Consider testing the null hypothesis 


(2) H:B=B", 


8.11 ELLIPTICALLY CONTOURED DISTRIBUTIONS 371 


where B* is completely specified. In Section 8.3 a more genera! hypothesis 
was considered for B partitioned as B = (B,, B,). However, as shown in that 
section by the transformation (4), the hypothesis B, = By can be reduced to a 
hypothesis of the form (1) above. 


Let 
N A 
(3) G= ¥ (x, — Bz.) (x4 — Bia)’ =N%Xq,; 
(4) H=(B-B)A(B-B)’. 


Lemma 8.11.1. Under the conditions of Theorem 8.11.1 the limiting distri- 
bution of H is W(X, q). 


Proof. Write H as 

(5) H = VN(B -B) AVN (B-B)'. 

Then the lemma follows from Theorem 8.11.1 and (4) of Section 8.4. a 
We can express the likelihood ratio criterion in the form 

(6) —2log A= —N logU=N loglI+ G7 'H| 


=N log 


Lei \7! 
1+ y{ WS| Hl. 


Theorem 8.11.2. Under the conditions of Theorem 8.11.1, when the null 
hypothesis is true, 


(7) —2log aS x2. 


Proof. We use the fact that N log +N7~'C| =trC +O,(N7') when N > 
oo, since [I +xC] = 1+x tr C+ O(x’) (Theorem A.4.8). 


We have 


1 al P 4 os 
(8) tr( 76] HaN Do 2D, 87 (big — Big )4gn( On — Bia) 


i,jel g,h=l 
! r f 1 - f / d 
= [vec( B' — B’)] Fixe ' @ Al veo( B —-B’) Xa 


because (1/N)G 3, (1/N)A—A,, and the limiting distribution of 
VN vec(B’ — B') is NCZ @AG!). P| 


372 TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 


Theorem 8.11.2 agrees with the first term of the asymptotic expansion of 
—2]log 4 given by Theorem 8.5.2 for sampling from a normal] distribution. 
The test and confidence procedures discussed in Sections 8.3 and 8.4 can be 
applied using this x?-distribution. 

The criterion U = 4’/% can be written as U=T1#_,V,, where V, is defined 
in (8) of Section 8.4. The term V; has the form of U; that is, it is the ratio of 
the sum of squares of residuals of x;, regressed on X,,,--.Xj-), a» %q to the 
sum regressed On X145--+5Xj-1, It follows that under the aul nypethests 
Vi,...,V, are asymptotically independent and —N logV, 4, x. Thus 
—N rie -NEP., log V, * Kes This argument justifies the step-down 
procedure asymptotically. 

Section 8.6 gave several other criteria for the general linear hypothesis: 
the Lawley—Hotelling trace tr HG~', the Bartlett-Nanda-Pillai trace tr H(G 
+H)7', and the Roy maximum root of HG! or H(G + H)7'. The limiting 
distributions of N tr HG~' and N tr H(G + H)7' are again x/,. The limiting 
distribution of the maximum characteristic root of NHG~' or NH(G +)7! 
is the distribution of the maximum characteristic root of H having the 
distributions WU, q) (Lemma 8.11.1). Significance points for these test crite- 
ria are available in Appendix B. 


8.11.2. Elliptically Contoured Matrix Distributions 


In Section 8.3.2 the p X N matrix of observations on the dependent variable 
was defined as X = (x,,...,x,), and the q X N matrix of observations on the 
independent variables as Z=(z,,...,2Z,); the two matrices are related by 
éX = BZ. Note that in this chapter the matrices of observations have N 
columns instead of N rows. 

Let E=(e,,...,e,) be a pXN random matrix with density 
|Al “gl F-'EE'(F’)"'], where A = FF’. Define X by 


(9) X=BZ+E. 

In these terms the least squares estimator of B is 

(10) B =XZ'(2Z')"' =CA™', 

where C =XZ’ = D%_,x,z, and A=ZZ' = LN_,z,z',. Note that the density 
of & is invariant with respect to multiplication on the right by NXN 


orthogonal matrices; that is, E’ is left spherical. Then E’ has the stochastic 
representation 


(11) E' 5 UTF', 


8.11 ELLIPTICALLY CONTOURED DISTRIBUTIONS 373 


where U has the uniform distribution on U'U = I, T is the lower triangular 
matrix with nonnegative diagonal elements satisfying EE’ =1T', and F 


is a lower triangular matrix with nonnegative diagonal elements satisfying 
FF' =. We can write 


(122) B-B=E2'A7! Srr'u'z'a", 
(13) H =(B—B)A(B-B)' =E2'A7!ZE’ © FT'U'(Z'A“!Z)UTF’, 
(14) G =(X-BZ)(X-BZ)'-H=EE'-—H 

= E(Iy~Z'A7!Z) E! = FT'U'(Iy - Z'AZ)UTF. 


It was shown in Section 8.6 that the likelihood ratio criterion for H:B = 0. 
the Lawley—Hotelling trace criterion, the Bartlett-Nanda-Pillai trace crite- 
rion, and the Roy maximum root test are invariant with respect to linear 
transformations x > Kx. Then Corollary 4.5.5 implies the following theorem. 


Theorem 8.11.3. Under the null hypothesis B = 9. the distribution of each 
invariant criterion when the distribution of E' is left spherical is the same as the 
distribution under normality. 


Thus the tests and confidence regions described in Section 8.7 are valid 
for left-spherical distributions E'. 

The matrices Z'A~'Z and I, ~ Z'A~'Z are idempotent of ranks qg and 
N — q. There is an orthogonal matrix O, such that 


I, 0 0 9 
(15) 0Z'A~'ZO'=| 4 | O(Iy-Z'A-'Z)O' = } 
0 0 0 I, 


The transformation V= O'U is uniformly distributed on V'V=J,, and 


p 


tz, 0 0 0 
(16) H=kvV’'| ? VK’, G = K¥' VK’, 
0 0 ee 
where K = FT’. 
The trace criterion tr HG~', for example, is 
1 0 0 oO 

17 trHG-'=v'|'! \v v' Vi. 

ASO) : i 0 0 I, 4 


The distribution of any invariant criterion depends only on U (or F’), not 
on T. 


374 


TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 


Since G+H=FT'TF’, it is independent of U. A selection of a linear 
transformation of X can be made on the basis of G+H. Let D be a pxr 
matrix of rank r that may depend on G+H. Define x* =D’x,. Then 
xé&x* = (D'B)z,, and the hypothesis B=0 implies D'B=0. Let X* = 
(xf,....4%)=D'X, Bp =D'B, E, =D'E, Hp = D'HD, Gp = D'GD. Then 


s — 
Ey ma 


E’'D£UTF'D'. The invariant test criteria for Bp =0 are those for 


B = 0 and have the same distributions under the null hypothesis as for the 
normal distribution with p replaced by r. 


PROBLEMS 


8.1. (Sec, 8.2.2) Consider the following sample (for N = 8): 


8.2. 


8.3. 


8.4, 


8.5. 


Weight of grain 40 17 9 15 6 12 5 9 
Weight of straw 53 19 10 29 13 27 19 30 
Amount of fertilizer 24 11 5 12 7 14 11 18 


Let z,, = 1, and let z,, be the amount of fertilizer on the ath plot. Estimate B 
for this sample. Test the hypothesis B, = 0 at tte 0.01 significance level. 


(Sec. 8.2) Show that Theorem 3.2.1 is a special case of Theorem 8.2.1. 
[Himt: Let q=1,z,=1,B =] 


(Sec. 8.2) Prove Theorem 8.2.3. 


(Sec. 8.2) Show that B minimizes the generalized variance 


N 
ae (x, al Bz, (2, ~— Bz, )’ “ 


a= | 


(Sec. 8.3) In the following data [Woltz, Reid, and Colwell (1948), used by 
R. L. Anderson and Bancroft (1952)] the variables are x,, rate of cigarette bum; 
xy. the percentage of nicotine; z,, the percentage of nitrogen; z,, of chlorine; 
Z,. of potassium; z,, of phosphorus; z,, of calcium; and z,, of magnesium; and 
zz=1; and N=25:. 


53.92 
62.02 
N N 56.00 
ea laa yz, = 412.25 |, 
= onl 89.79 
24.10 
25 


Nv 
aye, _ ey _ (0.6690 0.4527 
LG %)(x, —¥) ie 6.5921 }? 


PROBLEMS 375 


N 
E (52) 44-8)! 


t 


1.8311 -—0.3589 -0.0125 —0.0244 1.6379 0.5057 0 
— 0.3589 8.8102 -0.3469 0.0352 0.7920 0.2173 0 
—0.0125 ~0.3469 1.5818 -0.0415 -1.4278 —-0.4753 0 
=| -0.0244 0.0352 -0.0415 0.0258 0.0043 0.0154 0], 
1.6379 0.7920 -14278 0.0043 3.7248 0.9120 0 
0.5057 0.2173 -—U.4753 0.0154 0.9120 03828 U 
0 0 0 0 0 0 0 
0.2501 2.6691 
—1.5136 —2.0617 
N 0.5007 —0.9503 
Yo (za -2)(x. -¥)' = | -0.0421  —-0.0187 
a=l -0.1914 3.4020 
—0.1586 1.1663 
0 0 


(a) Estimate the regression of x, and x On Z), Zs, 2,, and 27. 
(b) Estimate the regression on all seven variables. 
(c) Test the hypothesis that the regression on z2, z3, and z, is 0. 


8.6. (Sec, 8.3) Let g=2, z,,=w, (scalar), z., =1. Show that the U-statistic for 
testing the hypothesis B, = 0 is a monotonic function of a T*-statistic, and give 
the T*-statistic in a simple form. (See Problem 5.1.) 
8.7. (Sec. 8.3) Let z,, =1, let g, = 1, and let 
A* =)" (2_ -2))( Za %,) |) ij=l,....q)=q-1. 


Prove that 


(Bin —Bi)(An ~ Ay AZ'An )(Bio ~B)'= (Bra — B, )A* (Bia —B,)'. 


8.8. (Sec. 8.3) Let g, =q@,. How do you test the hypothesis B, = B,? 


8.9. (Sec. 8.3) Prove 


2 -} 
Bio = Lxq( 2) — 4,2 Ag! 29)’ LV (2 — Ay Az 2) (2) ~Ay Ay a) 


=(C,- C,Ay'An1)(Ayy —ApAyAn) | : 


376 TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 


8.10. (Sec. 8.4) By comparing Theorem 8.2.2 and Problem 8,9, prove Lemma 8.4.1. 


8.11. (Sec. 8.4) Prove Lemma 8.4.1 by showing that the density of Bia and Bo, is 
K, exp| —3tr 2~'(Brg — BY) 41).2(Bra = Bi)’| 
“Kz exp — itr 2~' (Bou ~ Bz) Azz(Bow ~ Br)’. 


8.12, (Sec. 8.4) Show that the cdf of U,,,,,, is 
T(n+2)r[3(n+1)] 
T(n-1)P(4n-1)¥7r 


ec T—-u  ykn-l) 


~a(n=1) + aT [aresin(2u = 1) = ta] 


n Vie 3(n+1) 


[Hint; Use Theorem 8.4.4. The region {0 <z, <1,0 sz, <$1,z7z, <u} is the 
union of {0<z,< 1,0 sz, <u} and {0 <z, su/z,,u<z,< 1},] 


2ui" [43 *) 2u2""1(1 —u)? | 
+ log} ——=—_ | + — --—-— }. 


8.13. (Sec. 84) Find Pr{U,,,, > u}. 
8.14, (Sec. 8.4) Find Pr{U, 4, > ub. 


8.15. (Sec, 8.4) For p<: find »EU® from the density of G and H. [Hint: Use the 
fact that the ae of K+ Li WV is W(2,s+¢) if the density of K is 
W(Z,s) and V,,..., are independently distributed as N(0, %).] 


8.16. (Sec. 8.4) 


(a) Show that when p is even, the characteristic function of Y= log U,, m,n» Say 
Git) = ©E e’”, is the reciprocal of a polynomial. 


(b) Sketch a method of inverting the characteristic function of Y by the 
method of residues. 


(c) Show that the resulting density of U is a polynomial in vu and logu with 
possibly a factor of ué ?. 


8.17. (Sec. 8.5) Use the asymptotic expansion of the distribution to compute Pr{—k 
log U;,,,, <M!*} for 


(a) n=8 M*=14.7, 
(b) n=8 M*=21.7, 
(ec) n= 16, M* = 14.7, 
(d) n= 16, M* =21.7. 


(Either compute to the third decimal place or use the expansion to the k~4 
term.) 


PROBLEMS 377 


8.18. 


8.19, 


8.20. 


8.21. 


8.22. 


8.23. 


(Sec. 8.5} Incase p= 3, q,=4, and n=N—gq= 20, find the 50% significance 
point for k log U (a) using —2log A as x? and (b) using —k log U as y?. Using 
more terms of this expansion, evaluate the exact significance levels for your 
answers to (a) and (b). 


(Sec. 8.6.5) Prove for /;>0,i=1,..., p, 


Poy. P P 
> T+T, slog TTC +h)<e V1. 


Comment: The inequalities imply an ordering of the values of the 
Bartlett-Nanda-Pillai trace, the negative logarithm of the likelihood ratio 
criterion, and the Lawley—Hotelling trace. 


(Sec. 8.6) The multivariate beia density. Let H and G be independently dis- 
tributed according to W(X, m) and W(X, n), respectively. Let C be a matrix 
such that CC’ = H +G, and let 


L=C"'HC'" |. 


Show that the density of L is 


[4c +n)] 


[LE] semren! ai = LI tn-p~1 ) 
U(2m)0,(3") 


for L and I—L positive definite, and 0 otherwise, 


(Sec. 8.9) Let ¥,; (a p-component vector) be distributed according to N(p,,, >). 
where MEY, = pj =Bt+A,; +, +, LA; O= Lv =L,y,,= 5,¥,)3 the y,, 
are the interactions. If m observations are made on each Yi (say Yyprees Jum) 
how do you test the hypothesis A,=0, i=1,...,7? How do you test the 
hypothesis ¥;;=0,1= Lowey het Sao €? 


(Sec. 8.9) The Latin square. Let ¥ijo GJ=1,....7, be distributed according to 
N(p,;,%), where SEY, = pjyj= yt, tv; +p, and k=j—i+1 (mod r) with 
LA; = Lv,= Lp, = 0. 


(a) Give the univariate analysis of variance table for main effects and error 
(including sums of squares, numbers of degrecs of freedom. and mean 
squares). 

(b) Give the table for the vector case. 

(ec) Indicate in the vector case how to test the hypothesis A,=0,i=1..... r, 


(Sec. 8.9) Let x, be the yield of a process and x. a quality measure. Let 
z,;=1, z.= +10° (temperature relative to average) -;= +0.75 Uelative mea- 
sure of flow of one agent), and z,= +1.50 (relative measure of flow of another 
agent), [See Andcrson (1955a) for details.] Three observations were made on x, 


378 


8.24, 


8,25. 


8.26. 


8.27, 


TESTING THE GENERAI. LINEAR HYPOTHESIS, MANOVA 


and x. for each possible triplet of values of z,, z,, and z,. The estimate of B is 


B= 58.529 —0.3829 -5.050 2.308 }. 
98.675 0.1558 4.144 —0.700}’ 


5, = 3.090, 5, = 1.619, and r= — 0.6632 can be used to compute $ or E. 


(a) Formulate an analysis of variance model for this situation, 
(b) Find a confidence region for the effects of temperature (i.e., 8,2, B22). 


(c) Test the hypothesis that the two agents have no effect on the yield and 
quantity. 


(Sec. 8.6) Interpret the transformations referred to in Theorem 8.6.1 in the 
original terms; that is, H:B, = By and z”. 


(Sec. 8.6) Find the cdf of tr HG~! for p = 2. (Hint: Use the distribution of the 
roots given in Chapter 13.] 


(Sec. 8.10.1) Bartlett-Nanda-Pillai V-test as a Bayes procedure. Let 
WWo,---s."ma, De independently normally distributed with covariance matrix 
X and means ~Ew, = -y,,/=1,...,m, -Ew,=0,i=m-+1,...,m+n. Let Ip be 
defined by [T,, =] =[0, (+ CC’) '], where the pxm matrix C has a density 
proportional to [f+ CC’|~2"*™, and T= (y,,-+-)¥m3 let Tl, be defined by 


[Ey Z)=[Ur+ cc')~'c, +cC')- 1] where C has a density proportional iS 
|7+Cc’|~ pharm) gr CUs ll) 1c 


(a) Show that the measures are finite for 1 >p by showing tr C’(1+ CC’)"'C 
a and verifying that the integral of |¥ + CC’|~ a(n+m) _ finite. [ Hint: Let 
= (6)325 ct Pe a =E,.,d), j= n(E, =). 
ck IDI = |D,_ ave +d\d,) and hence | Dy | = Td +e oa) Then refer 
to Problem 5. 15. ] 


(b) Show that the inequality (26) of Section 5.6 is equivalent to 


mtn -l in 
a x “mi Yi ww ak, 
i=] I 


=| 
Hence the Bartlett-Nanda-—Pillai V-test is Bayes and thus admissible. 


(Sec. 8.10.1) Likelihood ratio test as a Bayes procedure, Let W,...,%n4, be 
independently normally distributed with covariance matrix & and means ae 
=, i=1,...,.m, Ew, =0, i=m+1,...,m +n, with n2m +p. Let [Ty be 

defined by [T,,¥]=[0, (+ ra Oa) ala By where the pXm matrix C has a density 
proportional to |¥ + CC’|~ 3*") and Ty =(y,,...,Y¥m); let I, be defined by 


[T,,2]=[C+ec’) ep, (1+cc'y"|], 


PROBLEMS 379 


8.28. 


8.29. 


8.30. 


8.31. 


8.32. 


8.33. 


8.34, 


8.35. 


where the -m columns of D are conditionally independently normally distributed 
with means 0 and covariance matrix [f— C’(1 +€C’)~'C]7!, and C has (margi- 
nal) density proportional to 


Inc 72" *™ 1 - o'r e-ecly Cl 2. 
(a) Show the measures are finite. [Hint: See Problem 8.26.] 


(b) Show that the inequality (26) of Section 5.6 is equivalent to 


[o7ei" wi | 


Jen w,w'] zk, 
[ere ad tad 
Hence the likelihood ratio test is Bayes and thus admissible. 


(Sec. 8.10.1) Admissibility of the likelihood ratio test. Show that the acceptance 
region |ZZ'| /|ZZ’ + XX'| >c satisfies the conditions of Theorem 8.10.1. [Hint: 
The acceptance region can be written []{.,m,>c, where m,=1-A,, i= 


| 


(Sec. 8.10.1) Admissibility of the Lawley-Hotelling test. Show that the accep- 
tance region tr XX'(ZZ’)~! <c satisfies the conditions of Theorem 8.10.1. 


(Sec. 8.10.1) Admissibility of the Bartlett-Nanda- Pillai trace test. Show that the 
acceptance region tr X¥’'(ZZ’ + XX')~'X <c satisfies the conditions of Theorem 
8.10.1. 


(Sec. 8.10.1) Show that if A and B are positive definite and A —B is positive 
semidefinite, then B~'—A~! is positive semidefinite. 


(Sec. 8.10.1) Show that the boundary of A has m-measure 0, [ Hint: Show that 
(closure of A) CA UC, where C ={(WIU— YY’ is singular}.] 


(Sec. 8.10.1) Show that if ACR is convex and monotone in majorization, 
then A* is convex. [ Hint: Show 


(PX +4V)1 ~~ PF, +4QV1, 
where 
Zy = (2yse-e> Am)’ ERZ-] 


(Sec. 8.10.1) Show that C(A) is convex. [ Hint: Follow the solution of Problem 
8.33 to show ( px +qy)<,A if x~,A and y~<,A.] 


(Sec. 8.10.1) Show that if A is monotone, then A* is monotone. [ Hint: Use 
the fact that 


Xe) = Max ax {min(x,,. i) 
hy. ty 


380 


8.36. 


8.37. 


8.38. 


8.39. 


TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 


(Sec. 8.10.2) Monotonicity of the power function of the Bartlett-Nanda- Pillai 
trace test. Show that 


tr (uu! + B)(uu' +B+W)'<K 


is convex in u for fixed positive semidefinite B and positive definite B+ W if 
0<K <1. [Hint: Verify 


(uu'+B+W)7! 


=(B+W)'- “eo 'ywu'(B+w)! 


The resulting quadratic form in u involves the matrix (trA)J—A for A= 


(B+ W)” ?B(B + W)7 2; show that this matrix is positive semidefinite by diago- 
nalizing A,] 


(Sec. 8.8) Let x, a= 1,..., N,, be observations from M(pw, ©), v = 1,...5q, 
What criterion may be used to test the hypothesis that 


pi? = x YnChy + BP; 
k= 


where c,, are given numbers and y,,p ate unknown vectors? [Nofe: This 
hypothesis (that the means lie on an m-dimensijonal hyperplane with ratios of 
distances known) can be put in the form of the general linear hypothesis. ] 


(Sec, 8.2) Let x, be an observation from N(Bz,,%), a =1,...,N. Suppose 
there is a known fixed vector ‘y such that By = 0, How do you estimate B? 


(Sec. 8.8) What is the largest group of transformations on y“?, a=1,...,N,, 


i=1,...,q, that leaves (1) invariant? Prove the test (12) is invariant under this 
group. 


CHAPTER 9 


Testing Independence of 
Sets of Variates 


9.1. INTRODUCTION 


In this section we divide a set of p variates with a joint normal distribution 
into g subsets and ask whether the q subsets are mutually independent; this 
is equivalent to testing the hypothesis that each variable in one subset is 
uncorrelated with each variable in the others. We find the likelihood ratio 
criterion for this hypothesis, the moments of the criterion under the null 
hypothesis, some particular distributions, and an asymptotic expansion of the 
distribution. 

The likelihood ratio criterion is invariant under linear transformations 
within sets; another such criterion is developed. Alternative test procedures 
are step-down procedures, which are not invariant, but are flexible. In the 
case of two sets, independence of the two sets is equivalent to the regression 
of one on the other being 0; the criteria for Chapter 8 are available. Some 
optimal properties of the likelihood ratio test are treated, 


9.2. THE LIKELIHOOD RATIO CRITERION FOR TESTING 
INDEPENDENCE OF SETS OF VARIATES 


Let the p-component vector X be distributed according to N(p, 2). We 
partition X into gq subvectors with p,, P2,...,P, components, respectively: 


An Introduction to Multivariate Statistical Analysis, Third Edition. By T. W. Andersou 
ISBN 0-471-36091-0 Copyright © 2003 John Wiley & Sons, Inc, 


381 


382 TESTING INDEPENDENCE OF SETS OF VARIATES 


that is, 
xy 
xX 
(1) x=". 
yo 


The vector of means p and the covariance matrix & are partitioned similarly, 


pid 
po 
(2) p=]. |, 
pi 
Lu >in Zh, 
(3) S= La Zo Lo, 
ot X42 oq 


The null hypothesis we wish to test is that the subvectors X‘?,..., X' are 


mutually independently distrituted, that is, that the density of X factors into 
the densities of X",..., X). It is 


q 
(4) A:n(xlp, 2%) = [[n(x|p, £,,). 
Jal] 
If X,...,X are independent subvectors, 
5 Cy) gle) tj) _ a. , . 
(5) &(X” XY — py =%,,=0. ix jf. 


(See Section 2.4.) Conversely, if (5) holds, then (4) is true. Thus the null 
hypothesis is equivalently H: 2, = 0, i+ j. This can be stated alternatively as 
the hypothesis that & is of the form 


ZO 0 
0 3, 0 

(6) SOS. 8 
0 0 ae 


Given a sample x,....,2%, of N observations on X, the likelihood ratio 


9.2 LIKELIHOOD RATIO CRITERION FOR INDEPENDENCE OF SETS 383 


criterion is 


max, 5, L(p, Xo) 


7 er——oSSr = 
(7) max, s L(p, Z) 
where 

ll 1 +¢ yz } 
ro Lip, = -9g7 eS One 
(8) (w.2)= 11 on 


and L(p, Xo) is L(p, &) with X,,= 0, i *j, and where the maximum is taken 
with respect to all vectors and positive definite £ and X, (ie., %,,). As 
derived in Section 5.2, Equation (6), 


1 t 
9 maxL(p,£) =———>-~=——e" ", 
Md ut ( Qa sal 
where 
a ly a N 
NV oe 
Under the null hypothesis, 
4q 
(11) ; L(n, Xo) = [1L,(u, £,,), 
where 
(12) Li(p (A £,)= 1] ——e aC) = BONED = pO) 
i Gia (27) 13, ; 
Clearly 
q 
(13) max L(p, Xo) = I] max L(w”" Lis) 
un, 2p i=lp 1S, 
yt a 
= payee 
iat (2a) E/N 
-— aPN 
(2)? TTL Eo) 
where 
x ee 
(14) hie = Ny i (x? — x) (x8 yg)", 


a=] 


384 TESTING INDEPENDENCE OF SETS OF VARIATES 


If we partition A and X, as we have &, 


Ay Ay A, Di ae Sis 
(15) = ik i ve - $, = en = zy 
Aq Ap Aug Sat s Sa 
we see that &,, = 3,,=(1/N)Aj;. 
The likelihood ratio criterion is 
max,s,L(m,%o) = [ql LN 


( ) max,» L(y, 2) 17. 12 171A uN 


The critical region of the likelihood ratio test is 
(17) AS A(e), 


where a(e) is a number such that the probability of (17) is ¢ with & = Xp. (It 
remains to show that such a number can be found.) Let 


|Al 
18 V= ———_.., 
oe m.,14, 

Then A=V2" is a monotonic increasing function of V. The critical region 
(17) can be equivalently written as 


(19) V<V(e). 


Theorem 9.2.1, Let x,,...,Xy be a sample of N observations drawn from 
N(p, &), where x,, p, and & are partitioned into p,,..., p, rows (and columns 
in the case of &) as indicated in (1), (2), and (3). The likelihood ratio criterion 
that the q sets of components are mutually independent is given by (16), where A 
is defined by (10) and partitioned according to (15). The likelihocd ratio test is 
given by (17) and equivalently by (19), where V is defined by (18) and ACe) or 
V(e) is chosen to obtain the significance level e. 


Since r,, =a,///4;;4,,, we have 


(20) [Al = [RI au, 


9.2 LIKELIHOOD RATIO CRITERION FOR INDEPENDENCE OF SETS 385 


where 
Ry, Ry R,, 
Rz, Ry» R,, 
(21) R= (7,;) = : ; 
Ry, Ri. Rig 
and 
Pit tp, 
(22) 4-121 IL a,. 
J=pit +p. tl 
Thus 
|Al [RI 
23 V= == : 
(25) ila! ~ TiR,| 


That is, V can be expressed entirely in terms of sample correlation coeffi- 
cients. 

We can interpret the criterion V in terms of generalized variance. Each 
set (x,,,...,X;y) can be considered as a vector in N-space; the let (x,, — 
Xjs.++)X,y —¥,) =Z, Say, iS the projection on the plane orthogonal to the 
equiangular line. The determinant |A| is the p-dimensional volume squared 
of the parallelotope with z,,...,z, as principal edges. The determinant |4,,| 
is the p,-dimensional volume squared of the parallelotope having as principal 
edges the ith set of vectors. If each set of vectors is orthogonal to each other 
set (i.e., Rj;= 0, i* jf), then the volume squared |A| is the product of the 
volumes squared |A,,|. For example, if p= 2, p,=~p,=1, this statement is 
that the area of a parallelogram is the product of the lengths of the sides 
if the sides are at right angles. If the sets are almost orthogonal, then |A| 
is almost [1|A,,|, and V is almost 1. 

The criterion has an invariance property. Let C, be an arbitrary nonsingu- 
lar matrix of order p, and let 


C; 0 0 
0 C; 0 

(24) C2 E 
0 0 C 


Let Cx, +d =x%. Then the criterion for independence in terms of x% Is 
identical to the criterion in terms of x,. Let A* =L,(x* — x* X x* —x*)' be 


386 TESTING INDEPENDENCE OF SETS OF VARIATES 


partitioned into submatrices A*,, Then 


(25) A* = aC eas = ga) 23 er) 


a 


ss Cy Se aL | Bory —20VC 
a 


=CA,,C 
and A* = CAC’. Thus 
| A*| |CAC’| 
2 yt = = 
) TTA‘| = T1IC,A4,,C}] 
IC] -|AI -]C’| |A| 


SOO —— <= ee ee oem = V 
IC) -14,)-1C] THA, 


for }C| =I]|C,|. Thus the test is invariant with respect to linear transforma- 
tions within each set. 

Narain (1950) showed that the test based on V is strictly unbiased; that is, 
the probability of rejecting the null hypothesis is greater than the significance 
level if the hypothesis is not true. [See also Daly (1940).] 


9.3, THE DISTRIBUTION OF THE LIKELIHOOD RATIO CRITERION 
WHEN THE NULL HYPOTHESIS IS TRUE 


9.3.1, Characterization of the Distribution 


We shall show that under the null hypothesis the distribution of the criterion 
V ts the distribution of a product of independent variables, each of which has 
the distribution of a criterion U for the linear hypothesis (Section 8.4). 

Let 


Ay a Ayia A, 
A,-1.1 Avi gat Avis 
( ) Ay A= A,, ) 
1 yi ; P= 2,..654 
Ai Aj ny 
- | |A,,| 


9.3 DISTRIBUTION OF THE LIKELIHOOD RATIO CRITERION 387 


Then V=},V, + V,, Note that V, is the N/2th root of the likelihood ratio 
criterion for testing the null hypothesis 


(2) A,2%., = 0,..., 2; 7-1 = 0, 


that is, that X“ is independent of (X“’,,..,X¢~’)’, The null hypothesis 
H is the intersection of these hypotheses. 


Theorem 9.3.1. When H, is truc, V, has the distribution of U, 5. »- 5, where 
n=N-—1 andp,;=p,t-+ +p;_1, i= 2,...,g. 


Proof. The matrix A has the distribution of L7_,Z,Z’,, where Z,,...,Z, 
are independently distributed according to N(0,%) and Z, is partitioned as 
(Z)",...,Z'Y. Then conditional on ZO) =2)...,Z0°-0 = 24-0, ga 
1,...,m, the subvectors Z{’,..., Z“ are independently distributed, Z hav- 
ing a normal distribution with mean 


zi) 
(3) B; 


gy) 


and covariance matrix 


Zu _ >| 
(4) > ~ B, B, 
Bier 0" > ree) 


where 
-| 
Day +s Zant 


(5) B= (2. Xii-1) . 
ian si Qintye-t 


When the null hypothesis is not assumed, the estimator of B, is (5) with &, 
replaced by A,,, and the estimator of (4) is (4) with &,, replaced by (1/n) Aj 
and B, replaced by its estimator. Under H;:B, = 0 and the covariance matrix 
(4) is &,,, which is estimated by (1/n)A,,. The N/2th root of the likelihood 


388 TESTING INDEPENDENCE OF SETS OF VARIATES 


ratio criterion for H, is 


-I 


Ay, as Ay j-1 A); 
A,;— (Ajt,+.+) A, j-1) : ; 
6) A;_ 1,1 sre A,_),j-| Ast 
( IA, 
Ay Ay 5-1 A, 
Ai_1 1 A,_; i-] A;_,,, 
A, A, A;; 
Ay, Ay 5-1 
- | |Ajal 


Ait) ae 7 


which is V,. This is the U-statistic for p, dimensions, p,; components of the 
conditioning vector, and n —p, degrees of freedom in the estimator of the 
covariance matrix. | 


Theorem 9.3.2. The distribution of V under the null hypothesis is the 
distribution of V,V, ++» V,, where V,,...,V, are independently distributed with V, 
having the distribution 1 of U by Byn~py» Where P;= py t+ ++ +p;_1. 


Proof. From the proof of Theorem 9.3.1, we see that the distribution of V, 
is that of U, 5,,-s, not depending on the conditioning 2", k=1,...,i—1, 
a=1,...,”. Hence the distribution of V,; does not depend on V,,...,V_}. 


Theorem 9.3.3. Under the null hypothesis V is distributed as I}. TTP, X;;, 
where the X;,’s are independent and X;, has the density Blal a(n —p,+1- 


Dp, 3. 
Proof. This theorem follows from Theorems 9.3.2 and 8.4.1. | 


9.3.2, Moments 


Theorem 9.3.4. When the null hypothesis is true, the hth moment of the 
criterion is 
P[s(n—p,+1-J) +AlTi2(2+1-))] 
(7) “1 Mire ee ee CES EERIE 
jar PL3(n —p,41-f)]P[a(n+1-s) + 


9.3 DISTRIBUTION OF THE LIKELIHOOD RATIO CRITERION 389 


Proof. Because V,,..., V, are independent, 
(8) EV" = EVEEVE oH E Ae 


Theorem 9,3,2 implies &V," =U}, ,.. Then the theorem follows by 
substituting from Theorem 8.4.3. | 


If the p, are even, say p, = 2r,,/ > 1, then by using the duplication formula 
Tat $)1(at+ 1 = Vor PQat 1)27*" for the gamma function we can re- 
duce the Ath moment of V to 


(0) év'= TT 


af 2 T(n+1~p,-2k+2h)0 (n+ 1 - 2k) 
imo (ant V(t 1-p,—2k)P(n4+1- 2k 4+ 2h) 


q 


{E18 -'(n+1~7,.~2k,3) 
2 


i k=l 


ye ec abLaiuan —x)P! ax} ; 
) 
Thus V is distributed as I1¥.3{T1%., Yj), where the Y,, are independent, and 
Y,, has density B(y;n + 1 —p, ~ 2k, p,). 

In general, the duplication formula for the gamma function can be used to 
reduce the moments as indicated in Section 8.4. 


9.3.3. Some Special Distributions 


If g = 2, then V is distributed as U,. , ,~p,- Special cases have been treated 
in Section 8.4, and references to the literature given. The distribution for 
P| =P2=p;= 1 is given in Problem 9.2, and for p,; = p, =p, = 2 in Problem 
9.3, Wilks (1935) gave the distributions for p, =p,= 1, for p;=p-— 2," for 
Pi =1, p2=p3= 2, for py = 1, p2 = 2, p3= 3, for py = 1, p, = 2, p, = 4, and 
for py =p,=2, p;=3. Consul (1967a) treated the case p,=2, p,= 3, p, 
even. 

Wald and Brookner (1941) gave a method for deriving the distribution if 
not more than one p, is odd. It can be seen that the same result can be 
obtained by integration of products of beta functions after using the duplica- 
tion formula to reduce the moments. 

Mathai and Saxena (1973) pave the exact distribution for the general case. 
Mathai and Katiyar (1979) gave cxact significance points for p = 3(1)10 and 
n = 3(1)20 for significance levels of 5% and 1% (of —k log V of Section 9.4). 


Tin Wilks's form tla P[4(N — 2 —2)] should be P(r - 2-1]. 


390 TESTING INDEPENDENCE OF SETS OF VARIATES 


9.4. AN ASYMPTOTIC EXPANSION OF THE DISTRIBUTION OF THE 
LIKELIHOOD RATIO CRITERION 


The hth moment of A= V?" is 


112, T{4[N(1 +A) - if} 


1 Ook er 
) Ti (TaT($LNC +A) -J))} 


where K is chosen so that £A° = 1. This ts of the form of (1) of Section 8.5 
with 


a=p, b=p, %&= 


N ~jtpy to tp, 
(2) os {2 
JS PP ere eDaily Pir oF pa b=1,...,9. 


Then f= $[p(p + 1)- Lp(p, + D1 = 3p? — Lp?) By = 6, = 71 — p)N. 
In order to make the second term in the expansion vanish we take p as 


_ 2(p?- Ep?) +9(p? - Lp?) 


ie a 6N(p?- Lp; ) 


3 3 
3_ Tp: 
(4) apn eS 


2 3(p? Zp?) 


Then w.= y,/k?, where [as shown by Box (1949)] 


p'-Zp?  8(p?- Lp?) (p?- Lp3y’ 
(5) : woe 


We obtain from Section 8.5 the following expansion: 
(6) Pr{—klogV <u} = Pri x7 <} 


+ a [Pr{ x7. <v} ~ Pr{ x? <u} +O(k). 


9,5 OTHER CRITERIA 391 


Table 9.1 
Second 
P f D Yo N k ¥2/k? Term 
4 6 12.592 ul 15 a 0.0033 —0.0007 
5 10 18,307 48 15 & 0.0142 —,0021 
6 15 24.996 28 15 2 00393 — 0.0043 
16 B 0.0331 —0,0036 


If g = 2, we obtain further terms in the expansion by using the results of 
Section 8.5. 
If p; = 1, we have 


fe 3p(p-1), 
= 2p+11 
k=N- 6 3 
(7) _ Pip 


a 1) 
2) agg (2p? — 2p - 13), 


yy = PPD ( p—2)(2p -1)(p + 0) 


other terms are given by Box (1949). If p; = 2 (p = 2q) 


f=2q(q-1), 
4g +13 
(8) k=nN- -L—, 


-1 
1 aa ). (gq? - 84-7). 


Table 9.1 gives an indication of the order of approximation of (6) for 
Pp; = 1. In each case v is chosen so that the first term is 0.95. 

If g = 2, the approximate distributions given in Sections 8.5.3 and 8.5.4 are 
available. [See also Nagao (1973c),] 


9.5. OTHER CRITERIA 


In case g = 2, the criteria considered in Section 8.6 can be used with G+ H 
teplaced by A,, and H replaced by A,, A3,'A,,, or G+H replaced by 4,, 
and H replaced by A, Aj,'A,>- 


392 TESTING INDEPENDENCE OF SETS OF VARIATES 


The null hypothesis of independence is that % — 2% )=0, where Xp is 
defined in (6) of Section 9.2. An appropriate test procedure will reject the 
null hypothesis if the elements of A — Ay are large compared to the elements 
of the diagonal blocks of 4, (where A, is composed of diagonal blocks 4;; 
and off-diagonal blocks of 0). Let the nonsingular matrix B, be such that 
B,,A,;Bi, =I, that is, 4;' = Bi,B,,, and let B, be the matrix with B,, as the 


ith diagonal block and 0’s as off-diagonal blocks. Then B, A,B, = J and 


0 BA, By, a BA, Boy 
(1) By(A—Ay)B, = | 202 : OS cee aE 
0 0) 70> ' . . 


This matrix is invariant with respect to transformations (24) of Section 9.2 
operating on A. A different choice of B,, amounts to multiplying (1) on the 
left by Q) and on the right by Q>, where Q, is a matrix with orthogonal 
diagonal blocks and off-diagonal blocks of 0’s. A test procedure should reject 
the null hypothesis if some measure of the numerical values of the elements 
of (1) is too large. The likelihood ratio criterion is the N/2 power of 
|B,(A — A,) Bi, +J| = |B) ABS). a 
Another measure, suggested by Nagao (1973a), is 


(2) dtr[ By( A — Ay) By] = 4tr[(A — 4.) 4g") = Mir(Ads" 1) 


q 
=4 0 wA,A>'4, Az", 

jst 

aj 
For g=2 this measure is the average of the Bartlett-Nanda—Pillai trace 
criterion with G+ H replaced by A,, and H replaced by A,,A3,'A,, and the 
same criterion with G+H replaced by A,, and H replaced by A, Aj}'Ay). 

This criterion multiplied by n or N has a limiting y?-distribution with 

number of degrees of frecdom f= 4(p?~—i_,p?), which is the same 
number as for —N logV. Nagao obtained an asymptotic expansion of the 
distribution: 


(3) Pr{ in tr( Adj! ~1) <x} 


x 
1 q q 
t(P°= 2p LP + 2d p)|Prl xA6 sx} 


f= r=] 


96 STEP-DOWN PROCEDURES 393 


1 i] 9 ‘l 7 f > 4 
+g|-2p? + 4p Y pp - 200 pp? + Lp? |Pr{ xf.a <4] 
i=1 pat f= 
1 d ») 2 : 2 
va[p-e ype tp? Xp] Pe xfs 2] 
t=] p=] 


1 es oe : 2 
~ yy [20° 23502 + 3p?— 95. 02 Po x <n] +0 ae 


\ i=t pet 


9.6. STEP-DOWN PROCEDURES 


9.6.1. Step-down by Blocks 


It was shown in Section 9,3 that the N/2th root of the likelihood ratio 
criterion, namely V, is the product of g-1 of these criteria, that is, 
V,,...,V,. The ith subcriterion V, provides a likelihood ratio test of the 
hypothesis 7 {(2) of Section 9,3] that the ith subvector is independent of the 
preceding i — 1 subvectors. Under the null hypothesis H [= 1 ‘%.,.AH,], these 
q~1 criteria are independent (Theorem 9.3.2), A step-down testing proce- 
dure is to accept the null hypothesis if 


(1) Vi2u(e,), i=2,...,g. 


and reject the null hypothesis if V.<ue,) for any i. Here v(e,) is the 
number such that the probability of (1) when H, is true is 1—e,. The 
significance level of the procedure is é satisfying 


(2) 1-~e=[[(1-4). 


The subtests can be done sequentially, say, in the order 2,...,q. As soon as a 
subtest calls for rejection, the procedure is terminated: if no subtest leads to 
rejection, H is accepted. The ordering of the subvectors is at the discretion 
of the investigator as well as the ordering of the tests, 

Suppose, for example, that measurements on an individual are grouped 
into physiological measurements, measurements of intelligence, and mea- 
surements of emotional characteristics. One could test that intelligence is 
independent of physiology and then that emotions are independent of 
physiology and intelligence, or the order of these could be reversed. Alterna- 
tively, one could test that intelligence is independent of emotions and then 
that physiology is independent of these two aspects, or the order reversed. 
There is a third pair of procedures, 


394 TESTING INDEPENDENCE OF SETS OF VARIATES 


Other*criteria for the linear hypothesis discussed in Section 8.6 can be 
used to test the component hypotheses H,,...,H, in a similar fashion. 
When H, is true, the criterion is distributed independently of X{?,..., ¥¢7), 
a= 1,..,,N, and hence independently of the criteria for H,,..., H_;. 


t 


9.6.2. Step-down by Components 


In Section 8.4.5 we discussed a componentwise step-down procedure for 
testing that a submatrix of regression coefficients was a specified matrix. We 
adapt this procedure to test the null hypothesis 4; cast in the form 


Dar Lie ae Laint 
Lo Loo Loint 


-I 


(3) Ai, (2 2 ales Diet) =0, 
> Z5-1,2 i > ee 

where 0 is of order p, X p,. The matrix in (3) consists of the coefficients of the 

repression of X on (X* ,,., XG" DY, 


For i = 2, we test in sequence whether the regression of X, ,, on X (1) 
(X,,...,X,,)' is 0, whether the regression of X,,, on X is 0 in the 


regression of X, ,, on X" and X, ,,,.-., aid whether the regression of 
Qi (1) 
Xp, +p, on X© is O in the regression of X,,,,, OM X, Xp stores Xp apaete 


These hypotheses are equivalently that the first, second,..., and pth rows 
of the matrix in (3) for i = 2 are 0-vectors. 
Let A‘*) be the & xk matrix in the upper left-hand corner of 4,;, let AS) 


consist of the upper k rows of A,,, and let A“ consist of the first k columns 


of A,,, k= 1,..., p,. Then the criterion for testing that the first row of (3) is 0 
iS 
(4) 
ai 
A, ae Ay y-1 At? 
X= [A= (AP AP) : : |l+ag 
Ayaqyi OAs AY 
Ay, as Ay ,-1 Ai 
t 
Begs 8 + Ape AM, 
Ay A al 
Ay tae Ay ,-1 
: Ap 


9.6 STEP-DOWN PROCEDURES 395 


For k > 1, the criterion for testing that the kth row of the matrix in (3) is 0 is 
[see (8) in Section 8.4] 


(5) 
i 
k 
Ay A,,, l Ay 
k kK) ... AUK : . 
AS) — (AY? Ai) _.) 
Avia Ay eiisi A) 1 
Xp = 
! k~1 
Ay Ay, ;-4 Anh 
(k-1 kN .. gtkel : 
ARON — (At AT) 
Ait Avi; 1 Aken 
k 
_ lal 
: k-1 
ag 
k 
Ay aaa Ay) A) 


k 
A;_\ Be Abe Lyell A i 
k k k 
Aw) ane Ay Aw ARYL )| 
k-1 “VT Atk? 
Ay was A, 1-4 AY ) | A‘ | 


k= 2,..., Di, B= 2.06654. 


Under the null hypothesis the criterion has the beta density B[x;3(n — p,+ 1 
~ j),3p;]. For given i, the criteria X,,,...,X,, ate independent (Theorem 
8.4.1). The sets for different i are independent by the argument in Section 
9.6.1. 

A naan procedure consists of a sequence of tests based on 
Xqjy-++> Xap, X3,,+++, Xqp- A particular component test leads to rejection if 


boxe pees 
(6) X,; : ae > F, n=p,41-j(&j)> 


The significance level is ¢, where 


q 


(7) eS Tae. 


i=2 j=l 


396 TESTING INDEPENDENCE OF SETS OF VARIATES 


The sequence of subvectors and the sequence of components within each 
subvector is at the discretion of the investigator. 

The criterion V, for testing H; is V,;=T1f:,X,,, and criterion for the null 
hypothesis H is 


q q 
(8) V= [Th = TTT Xe. 


These are the random variables described in Theorem 9.3.3. 


9,7. AN EXAMPLE 


We take the following example from an industrial time study [Abruzzi 
(1950)]. The purpose of the study was to investigate the length of time taken 
by various operators in a garment factory to do several elements of a pressing 
operation. The entire pressing operation was divided into the following six 
elements: 


. Pick up and position garment. 

. Press and repress short dart. 

. Reposition garment on ironing board. 

. Press three-qyarters of length of long dart. 
. Press balance of long dart. 

. Hang garment on rack. 


nN aA BR & Be 


In this case x, is the vector of measurements on individual a. The compo- 
nent x;, is the time taken to do the ith element of the operation. N is 76. 
The data (in seconds) are summarized in the sample mean vector and 
covariance matrix: 


9.47 
25.56 
13.25 
31.44 |’ 
27.29 

8.80 


2.57 085 156 1.79 1.33 0.42 
0.85 37.00 3.34 13.47 7,59 0.52 
(2) Ss 150 3.34 844 5.77 2.00 0.50 
1.79 13.47 5.77 34.01 10.50 1,77]° 
1.33 7.59 2.00 10.50 23.01 3.43 
0.42 052 050 1.77 3.43 4.59 


) 
It 


(1) 


$8 THE CASE OF TWO SETS OF VARIATES 397 


The sample standard deviations are (1.604, 6.041, 2.903. 5.832, 4.798. 2.141). 
The sample correlation matrix is 


1.000 0.088 0,334 0.191 0,173 0.123 
0.088 1.000 0,186 0,384 0.262 0.040 
0.334 0.186 1.000 0.343 0.144 0.080 
0.191 0.384 0.343 1.000 0.375 0.142 |’ 
0.173 0.262 0.144 0.375 1,000 0.334 
0.123 0.040 0.080 0.142 0.334 1.000 


The investigators are interested in testing the hypothesis that the six 
variates are mutually independent. It often happens in time studies that a 
new operation is proposed in which the elements are combined in a different 
way; the new operation may use some of the elements several times and some 
elements may be omitted. If the times for the different elements in the 
operation for which data are available are independent, it may reasonably be 
assumed that they will be independent in a new operation. Then the 
distribution of time for the new operation can be estimated by using the 
means and variances of the individual items. 

In this problem the cr.terion V is V = |R| = 9.472, Since the sample size is 
large we can use asymptotic theory: k= “8, f=15, and ~klogV=54.1, 
Since the significance point for the y*-distribution with 15 degrees of 
freedom is 30.6 at the 0.01 significance level, we find the result significant. 
We reject the hypothesis of independence; we cannot consider the times of 
the elements independent. 


(3) R= 


9.8. THE CASE OF TWO SETS OF VARIATES 


In the case of two sets of variates (g=2), the random vector X, the 
observation vector x,, the mean vector p, and the covariance matrix & ale 
pattitioned as follows: 


(1) 


The null hypothesis of independence specifies that 2 ,, = 0, that is, that & is 
of the form 


2 0 


398 TESTING INDEPENDENCE OF SETS OF VARIATES 


The test criterion is 


|A| 
) eave 

It was shown in Section 9.3 that when the null hypothesis is true, this 
criterion is distributed as U, ,, y-1-p,» the criterion for testing a hypothesis 
about regression coefficients (Chapter 8). We now wish to study further the 
relationship between testing the hypothesis of independence of two sets and 
testing the hypothesis that regression of one set on the other is zero. 

The conditional distribution of X© given X@ =x® is N[p!? + BOx® — 
we), Zyy2) = NIBP - 2 +, Zo] where B= 2y2z', Lu. = 
Ln 2B y Ly, and v= pO + BOX — pO), Let X* = XO, zk! = [(x — 
x) 1), BY =(B v), and &* = &,, 5. Then the conditional distribution of X* 
is N(B*z*, &*). This is exactly the distribution studied in Chapter 8. 

The null hypothesis that 2 |, = 0 is equivalent to the null hypothesis B = 0. 
Considering x fixed, we know from Chapter 8 that the criterion (based on 
the likelihood ratio criterion) for testing this hypothesis is 


|Z( xt — Bhzt)(t - Bhzt) 
[E (es — Bact )(at - Back?) |’ 


(4) - 


where 


1 1 { 2 
yD aa Veet ye Be a gx! 7] 


Make alg eee 


x 


(Sexe 2h Tee 


A, 0)! 
=(A Nx) | 22 | 
(An ) 0 N 
=(ApAz  ¥), 
The matrix in the denominator of U is 


N 
(6) E (a0) (3 — 2) ay, 


a=] 


98 THE CASE OF TWO SETS OF VARIATES 399 


The matrix in the numerator is 


N 
as = = 1 a _ = , 
(7) Y, [x@ 2 — Ay, Ag} (x — 2)] [xO — 2 — Ay, AG} (x -¥)| 


a=l1 


=A,,-A), Ay Ay, ' 
Therefore, 


-t 
(8) U= lAn Ansa -T ar : 
il ul 2 
which is exactly V. 

Now let us see why it is that when the null hypothesis is true the 
distribution of U = V does not depend on whether the X® are held fixed. It 
was shown in Chapter 8 that when the null hypothesis is true the distribuiion 
of U depends only on p, q,, and N-q,, not on z,. Thus the conditioual 
distribution of V given X® =x@ does not depend on x™ the joint distribu- 
tion of V and X® is the product of the distribution of V and the distribution 
of X”, and the marginal distribution of V is this conditional distribution. 
This shows that the distribution of V (under the null hypothesis) does not 
depend on whether the X® are fixed or have any distribution (normal or 
not). 

We can extend this result to show that if q> 2, the distribution of V 
under the null hypothesis of independence does not depend on the distribu- 
tion of one set of variates, say Xf, We have V=V,---V,, where V, is 
defined in (1) of Section 9.3, When the null hypothesis is true, V, is 
distributed independently of X,..., X(¢~" by the previous result, In turn 
we argue that V, is distributed independently of X{”,...,X0~". Thus 
V, ++ V, is distributed independently of X{. 


Theorem 9.8.1. Under the null hypothesis of independence, the distribution 
of V is that given earlier in this chapter if q—1 sets are jointly normally 
distributed, even though one set is not normally distributed. 


In the case of two sets of variates, we may be interested in a measure of 
association between the two sets which is a generalization of the correlation 
coefficient. The square of the correlation between two scalars X, and X, 
can be considered as the ratio of the variance of the regression of X, on X, 
to the variance of Xj; this is W( BX,)/V(X,) = B’0n/ oy, =(05/on)/oy 
= p},. A corresponding measure for vectors X") and X™ is the ratio of the 
generalized variance of the regression of X° on X™ to the generalized 


400 TESTING INDEPENDENCE OF SETS OF VARIATES 


variance of X“”, namely, 


(9) | Bx? (BX)! = [B28 | = ep a 

[Za [211 |2a1| 

Zo =n 
= =| ih ae 
Bal 
If p, =p, the measure is 
[Z.1 

10 —— 
uo [Za |] 2221 


In a sense this measure shows how well X can be predicted from X™), 

In the case of two scalar variables X, and X, the coefficient of alienation 
is of,/a;, where a7, = &(X,— BX,)* is the variance of X, about its 
regression on X, when &X,= &X,=Oand &CX,|X,) = BX. In the case of 
two vectors X‘ and X™, the regression matrix is B= 2%,,%3,', and the 
generalized variance of X" about its regression on X™ is 


(11) 
|= 
[Za] " 


Since the generalized variance of X" is | @&XX’| =|%,,|, the vector 
coefficient of alienation is 


[Z4) — 2p Bae Za! |x| 
12 Le Aa Sarl 2 el 
ee) bat =, -lEa 


| E((X™ = Bx? )(xo = Bx)'}| =|Z,- ep 7 Pa = 


The sample equivalent of (12) is simply V. 

A measure of association is 1 minus the coefficient of alienation. Either of 
these two measures of association can be modified to take account of the 
number of components. In the first case, one can take the p,th root of (9); in 
the second case, one can subtract the p,th root of the coefficient of 
alienation from 1. Another measure of association is 


tr &[BX? (BX)’] EXMXO tu Deb > ar 


(13) : - 


This measure of association ranges between 0 and 1. If X“ can be predicted 
exactly from X® for p, <p, (ie, %),.=0), then this measure is 1. If no 
linear combination of X“ can be predicted exactly, this measure is 0. 


9.9 ADMISSIBILITY OF THE LIKELIHOOD RATIO TEST 401 
9.9, ADMISSIBILITY OF THE LIKELIHOOD RATIO TEST 


The admissibility of the likelihood ratio test in the case of the 0-1 loss 
function can be proved by showing that it is the Bayes procedure with respect 
to an appropriate a priori distribution of the parameters. (See Section 5.6.) 


Theorem 9.9.1. The likelihood ratio test of the hypothesis that & is of the 
form (6) of Section 9,2 is Bayes and admissible if N> p+ 1. 


Proof. We shall show that the likelihood ratio test is equivalent to rejec- 
tion of the hypothesis when 


[F(*16)T1,(a@) 
(1) W—- — -- Be 
[F(10)11,( a8) 


wy 


where x represents the sample, @ represents the parameters (wu and %), 
f(x|@) is the density, and M1, and [ly afte proportional to probability mea- 
sures of @ under the alternative and null hypotheses, respectively. Specifi- 
cally, the left-hand side is to be proportional to the square root of II?.,|A,,| / 
|Al. 

To define I1,, let 


(2) p=(It+Ww) CW, Z=(84+W') 7, 


where the p-component random vector V has the density proportional to 
(l+oe'v) a n=N-—1, and the conditional distribution of Y piven V = pv is 
N[0,(1 + v'v)/N]. Note that the integral of (1 + vv) * is finite if n>p 
(Problem 5.15). The numerator of (1) is then 


(3) const fv f E+ en'| 


“7 


[x.- (I+ ve’)! wy|'(E+ ve')| x, — (I+ ve’) ‘ry 


oO 
ot 
no) 
en, 
i 
LS) 
Pd 


a=] 


1 a 


“(1 +v'v) +v'v)- ‘exp i al jdvdy. 


402 TESTING INDEPENDENCE OF SETS OF VARIATES 


The exponent in the integrand of (3) is —2 times 


‘ N 
(4) YE x (it ev')x,-2yv' ¥ ox, -+ Ny2p'(1+ ww’) 'v 


a=] a=] 


Ny? 
. l+v'v 


N N 
= Vxix tv’ Yo x xp 2yv' Ne + Ny? 


a=] a=) 
=WA+v’Av + Ne'E+N(y-2'v)’, 
where 4 = LN. | x, x, — Nxk’. We have uSed v’(I+ vy’) yp + (14+0'v)7) = 1, 


[from (7+ vp')~! =~ (14+ 0'v)"' pe’). Using [J + vv’) = 14+ 'v (Corollary 
A,3,1), we write (3) as 


1 eC ae chia a Liye flip od 1 lygeg 
(5) const 7 E47 8* x es f e- "4 dy = const|A| ~ 2e7 47 NPE 
~% —90 
To define TI, let = have the form of (6) of Section 9.2. Let 
(6) [uw z,,| = [re vorery voy, (re vor, od ere qs 
where the p,-component random vector V" has density proportional to 
(1+ e'p)- *, and the conditional distribution of Y, given VO =p is 


N[O,(1 + vO'vO)/N], and let (V,,Y,),-. ee #9) be mutually independent. 
Then the denominator of (1) is 


q ‘ 
(7) ] [const] 4,,| “Fexp[—4(tr A, + NE ORO) 


r=} 


= cons ll |A,,| ) exp| —4(tr A + Nx'x)]. 


rst 


The left-hand side of (1) is then proportional to the square root of 
14,1 /1Al. a 


This proof has been adapted from that of Kiefer and Schwartz (1965). 


9.10. MONOTONICITY OF POWER FUNCTIONS OF TESTS OF 
INDEPENDENCE OF SETS 


Let Z,=(Z9"",Z©)')'", a=1,...,n, be distributed according to 


9.10 MONOTICITY OF POWER FUNCTIONS 403 


We want to test H: %,,=0. We suppose p, <p, without loss of generality. 
Let py5.--s Pp, (1:2 7+ 2 p,,) be the (population) canonical correlation 
coefficients. (The p,’s are the characteristic roots of 27)! 24.23)! 2, Chap- 
ter 12.) Let R= diag( p,,...,p,,) and A=[R,0] (p, X py). 


Lemma 9.10.1, There exist matrices B, (p, Xp,), B, (p, Xp.) such that 
(2) B,>,,B, =Ih 6 B, XB, =I, BX) B, =A. 


Proof. Let m=p,, B=B,, F'= =? Bs, = =%,.257 in Lemma 8.10.13. 
Then F'F = B,% 1B, =I, B,% 1. BL, =B,ZF=A. | 


(This lemma is also contained in Section 12.2.) 

Let x, =B,Z, y,=B,Z, a=1,...,m, and X=(x,,...,x,), Y= 
Cyy,.2+5¥,) Then (x',, y,)', a=1,...,m, are independently distributed ac- 
cording to 


o\ { A 
Q Mal las 


The hypothesis H:%,. = 0 is equivalent to H: A=0 (ie., all the canonical 
correlation coefficients p,,...,,, are zero). Now given ¥Y, the vectors x,, 
a=1,...,m, are conditionally independently distributed according to 
N(Ay,, 1- AA’) = NAy,, I — R?). Then x* = = R?)- x, is distributed 
according to N(My,, I) where 


M=(D,0), 
(4) D = diag(6,,..., 5,,)s 
;=p/(1-p7)’, $=1,..., Pye 


Note that 67 is a characteristic root of £1.23, 22,;%j;12, where 21).. = 21 
~1 
— 22222 Zar 
Invariant tests depend only on the (sample) canonical correlation coeffi- 
cients r; = y/c,, where 


(5) ce (TOY ar) ae). 


S,=X*¥'(v¥')'¥X*!, 


S,=X*X*'— S$, =X*[F-¥'(yv') TY] xX"! 


404 TESTING INDEPENDENCE OF SETS OF VARIATES 


Then 
(7) A,(S,S7') = <— 


L=c- 


Now given Y, the problem reduces to the MANOVA problem and we can 
apply Theorem 8.10.6 as follows. There is an orthogonal transformation 
(Section 8.3.3) that carries X* to (U,V) such that S,=UU', S,=VWV’, 
U=(u,,...,4,.), V is pyX(n—p,), u, has the distribution N(6,, 9), 
i=1,...,p, (e, being the ith column of I), and N(O,7), '=p,+1,...5 Pa 
and the columns of V are independently distributed according to N(0, J). 
Then c,,...,¢,, are the characteristic roots of UU'(VV')~', and their distri- 
bution depends on the characteristic roots of MYY’M", say, 7?,. ‘nei Tie Now 
from Theorem 8.10.6, we obtain the following lemma. 


Lemma 9.10.2. If the acceptance region of an invariant test is convex in 
each column of U, given V and the other columns of U, then the conditional 
power given Y increases in each characteristic root +7 of MYY'M'. 


Lemma 9.10.3. if A> B, then A,(A)> A,(B). 


Proof. By the minimax property of the characteristic roots [see, eg., 
Courant and Hilbert (1953)], 


aoe AX x' Bx 
(8) ACA = teak ane vie Sones Ou rie = A,( B), 


where S, ranges over #-dimensional subspaces. a 


Now Lemma 9.10.3 applied to MYY'M’ shows that for every j, 7 is an 
increasing function of 6, = p,/(1 — p?)? and hence of p;. Since the inaneiaal 
distribution of Y does not sean on the p,’s, by taking the unconditional 
power we obtain the following theorem. 


Theorem 9.10.1, An invariant test for which the acceptance region is convex 
in each column of U for each set of fixed V and other columns of U has a power 
function that is monotonically increasing in each p;. 


9.11. ELLIPTICALLY CONFOURED DISTRIBUTIONS 


9.11.1, Observations Elliptically Contoured 


Let xX,,...;X, be N observations on a random vector X with density 


(1) |Al~?g[(x—v)'A7!(x-v)], 


9.11 ELLIPTICALLY CONTGURED DISTRIBUTIONS 405 


where €R4<o and R?=(x-—v)'A7'(x—v). Then €X=v and &(X¥- 
v\(X-v)' =S=(ER*/p)A. Let 


at es 


Cee | Le 3 =e 
(2) t= WN Xa S=H IT > (Fe 2) (22-2). 
a=t 


1 


a 


Then 


(3) ¥Nvec(S—%) ae N[@,(«+ 1)(E,:+K,,)(% OX) + « vee (vee x)’ 


where 1+ x =p&R‘/[(p + 2 @R’)*], 

The likelihood ratio criterion for testing the null hypothesis 2, = 0. 7 #j, 
is the N/2th power of U =[1%_,V,, where V, is the U-criterion for testing the 
null hypothesis 2,,;=0,..., 2,1, = @ and is given by (1) and (6) of Section 
9.3. The form of V, is that of the likelihood ratio criterion U of Chapter 8 
with X replaced by X“, B by B, given by (5) of Section 9.3, Z by 


x” 
(4) , eels ae 
Xerd 
and & by %,; under the null hypothesis B;=0. The subvector X-) is 
uncorrelated with X, but not independent of X“ unless (X07) XO’) is 
normal. Let 
Ay, “= Ay emt 
(5) Aeon) : 
Avy Ajay int 
(6) AGED 2 (Any, A,,-,) =Ae-h 


with similar definitions of S°7?, B@"Y), Se-), and FY, We write 
V,=1G,| /|G, +.H,|, where 


(7) H, = Ae Aen) gerbe 


= (N- 1) $V §O-D) 7 Serb, 
(8) G,=A,—H,=(N-1)S,-4.. 


Theorem 9.11.1. When X has the density (1) and the null hypothesis is truc. 
the limiting distribution of H, is W\(1 + «)2,,, p,) where p,=p, toe tp., 


406 TESTING INDEPENDENCE OF SETS OF VARIATES 
and p, is the nonber of components of X, 
Proof, Since ¢'-) = 0, we have &S“'~" = 0 and 


K 1 
(9) ES Sim = (4 + N = 5} Hem 


if j./<p, and k,m>p, or if j,f>p,, and k,m<p,, and &545;,= 0 other- 
wise (Theorem 3.6.1). We can write 


(10) & vee $*"" (veg SOP)" = Gy i Wat) (2 @ EY), 


Since $“7) >, EY and ¥N vec S“'")) has a iimiting normal distribution, 
Theorem 9.10.1 follows by (2) of Section 8.4, r 


Theorem 9.11.2. Under the conditions of Theorem 9.11.1 when the null 
hypothesis is true 


(11) —N log¥, 5 (1 + x) x2, 


Proof. We can write V, = |1+N7'(4G,)7'H,| and use N loglI+N7'c| = 
trC+ O,N~') and 


-I B,-\ B, 
(12) tr( WG] H=N ¥ » g's, g58"S,, 


-1 
=Novee S"*) | (46) @So'lvec Ss’), om 
Because X is uncorrelated with X¥°~" when the null hypothesis 
H,.2'") =0, V, is asymptotically independent of V,,...,V,;.,. When the 


null hypotheses H,,...,H, are true, V, is asymptotically independent of 


i 


V,....V_,. It follows from Theorem 9,10,2 that 


q 
(13) —N logV= -N 3 logl, 5 x?, 
i=2 


where f= L7.,p,p;= 3l p(p + I) — L4, p,(p, + 1]. The likelihood ratio test 
of Section 9,2 can be carried out on an asymptotic basis. 


9.11 ELLIPTICALLY CONTOURED DISTRIBUTIONS 407 


Let Ay = diag(Ay,...,A,,). Then 


& 
(14) htr( dg! 1) =4 A454, Ag 
a 


has the x/-distribution when % = diag(%,,,..-, %,,): Phe step-down proce- 
dure of Section 9.6.1 is also justified on an asymptotic basis. 


9.11.2. Elliptically Contoured Mafrix Distributions 


Let Y (p XN) have the density g(tr YY’). The matrix Y is vector-spherical; 
that is, vecY is spherical and has the stochastic representation vec Y= 
R vec U,,.y, where R? = (vec Y)’ vec Y= tr YY’ and vec U,,.y has the uniform 
distribution on the unit sphere (vec U,,.y )’ vec U,,y = 1. (We use the nota- 
tion U,, to distinguish fom U uniform on the space UU’ = J,). 

Let 


(15) X= vel, + CY, 


where A=CC' and C is lower triangular. Then X has the density 


(16) [A|"*?g [tr C“'(X— vely)(X’ —eyv’)(C’)] 
=|Al N? oltr (X'-—e€,v') Av'(X- ve'y)|- 


Consider the null hypothesis 2, = 9, i#j, or alternatively Ay,= 0, i+/, or 
alternatively, R,, = 0, i#j. Then C = diag(C,,,...,C€,,). 

Let M=JIy—(1/N)eyey; since M*=M, M is an idempoteat matrix 
with N-—1 characteristic roots 1 and one root 0. Then 4 =XMX' and 
A,, = X“MX". The likelihood function is 


(17) [Al ?e{tr Aw! 44 N(¥- v)(¥—v)'}). 
The matrix A and the vector ¥ are sufficient statistics, and the likelihood 
ratio criterion for the hypothesis H is (|A| /I1*.,|A,;1)*/*, the same as for 


normality. See Anderson and Fang (1990b). 


Theorem 9.11.3, Let f(X) be a vector-valued function of X (p XN) such 
that 


(18) f(X+ vey) =f(X) 


408 TESTING INDEPENDENCE OF SETS OF VARIATES 


forall v and 
(19) F(X) =f(X) 
for all K = diag(K,,,...,K,,). Then the distribution of f(X), where X has the 
arbitrary density (16), is the same as the distribution of f(X), where X has the 
normal density (16). 

Proof. The proof is similar to the proof of Theorem 4.5.4. | 

It follows from Theorem 9,11.3 that V has the same distribution under the 
null hypothesis H when X has the density (16) and for X normally dis- 
tributed since V is invariant under the transformation X > KX, Similarly, V, 
and the criterion (14) are invariant, and hence have the distribution under 
normality. 
PROBLEMS 

9.1. (Sec, 9.3) Prove 


Nea l[z(a+1-2) +AJNA (Ne T[3(n +1 -/)]} 
ne. T[4(nt¢1-a ne (Ne T[4(n4+ 1-7) +A]} 


EV" = 


by integration of V"w(A| Xo, 2). Hint: Show 


K(%,") 


q 
eaten el Ee Ae ieee ait 
ad = KS, n tan! fla. w(A, Xo, +2h) dA, 


where K(Z,n) is defined by w(A|¥,n)=K(%, n)lAl 29797) e7 22714, Use 
Theorem 7.3.5 to show 


sya eA Gun) [Kaan $20 


K(%o,.n+2h) 44 K(z,,.") fe f(A Ei”) ua, 
9,2, (Sec. 9.3) Prove that if p, =p. =p3= 1 [Wilks (1935)] 
Pr(V sv} =1,[4(n—1),4] +287 [41 —- 1), $]sin v1 - 0. 


[ Hint: Use Theorem 9.3.3 and Pr{V < v} = 1 — Pr{v< V}.] 


PROBLEMS 409 


9.3, 


9.4, 


9.5, 


9.6. 


9.7. 


9.8. 


9.9, 


9.10, 


91 1. 


(Sec. 9.3) Prove that if p, =p. =p, = 2 [Wilks (1935)] 
Pr{V <v} =1g(n-5,4) 


+Bo'(n~- 5,4) 08" ){n /6 ~ 3(n-1)¥o —3(n-4)v 
+(En- $)037 
—3(n—2)v log v— 3(n— 3)u"* log vo}. 
[ Hint: Use (9).] 


(Sec. 9.3) Derive some of the distributions obtained by Wilks (1935) and 
referred to at the end of Section 9.3.3. [Hint: In addition to the results for 
Problems 9,2 and 9.3, use those of Section 9,3.2.] 


(Sec. 9.4) For the case p, = 2, express k and y.. Compute the second term of 
(6) when v is chosen so that the first term is 0.95 for p= 4 and 6 and N= 15. 


(Sec. 9.5) Prove that if BAB’ = CAC’ =] for A positive definite and B and C 
nonsingular then B= QC where Q is orthogonal. 


(Sec. 9.5) Prove N times (2) has a limiting y?-distribution with f degrees of 
freedom under the null hypothesis. 


(Sec. 9.8) Give the sample vector coefficient of alienation and the vector 
correlation coefficient. 


(Sec. 9.8) If y is the sample vector coefficient of alienation and z the square 
of the vector correlation coefficient, find “y*z" when %,, = 0. 


(Sec. 9.9) Prove 


eo 


1 
foo fe —t ty ty, < = 
= oO (1 + EP uP) 


if p<n, [Hint: Let yj=wyl se ae ee bog 1,....p— 1, in turn] 


Let x,= arithmetic speed, x,= arithmetic power, x; = intellectual interest. 
x, = soc al interest, x, = activity interest. Kelley (1928) observed the following 
correlations between batteries of tests identified as above, based on 109 pupils: 


1.0000 0.4249 -—0.0552 -0.0031 0.1927 
0.4249 1.0000 -0.0416 0.0495 0.0687 
-—0.0552 -0.0416 1.0000 0.7474 0.1691 
— 0.0031 0.0495 0.7474 1.0000 0.2653 
0.1927 0.0687 0.1691 0.2653 1.0000 


410 : TESTING INDEPENDENCE OF SETS OF VARILATES 


Let x! =(x,.4,) and x©" = (x5, 44,45). Test the hypothesis that x is 
independent of x'-) at the 1% significance level. 


9.12, Carry out the same exercise on the data in Problem 3.42, 


9.13. Another set of time-study data [Abruzzi (1950)] is summarized by the correla- 
tion matrix based on 188 observations: 


100 -0.27 0.06 0.07 0.02 
=O:27 100 -001 -002 —0.02 
006 -001 100 -0.07 —0.04 
0.07 -002 -007 1.00 -—0.10 
0.02 -0.02 -0.04 -0.10 1.00 


Test the hypothesis that o,;=0, i #;, at the 5% significance level. 


CHAPTER 10 


Testing Hypotheses of Equality 
of Covariance Matrices and 
Equality of Mean Vectors and 
Covariance Matrices 


10.1. INTRODUCTION 


In this chapter we study the problems of testing hypotheses of equality of 
covariance matrices and equality of both covariance matrices and mean 
vectors. In each case (except one) the problem and tests considered are 
multivariate generalizations of a univariate problem and test. Many of the 
tests are likelihood ratio tests or modifications of likelihood ratio tests. 
Invariance considerations lead to other test procedures. 

First, we consider equality of covariance matrices and equality of covari- 
ance matrices and mean vectors of several populations without specifying the 
common covariance matrix or the common covariance matrix and mean 
vector. The multivariate analysis of variance with random factors is consid- 
ered in this context. Later we treat the equality of a covariance matrix to a 
given matrix and also simultaneous equality of a covariance matrix to a given 
matrix and equality of a mean vector to a given vector. One other hypothesis 
considered, the equality of a covariance matrix to a given matrix except for a 
proportionality factor, has only a trivial corresponding univariate hypothesis. 

In each case the class of tests for a class of hypotheses leads to a 
confidence region. Families of simultaneous confidence intervals for covari- 
ances and for ratios of covariances are given. 


An Introduction to Multivariate Statistical Analysis, Third Edition. By T, W, Anderson 
ISBN 0-471-36091-0 Copyright © 2003 John Wiley & Sons, Inc. 


411 


412 TESTING HYPOTHESES OF EQUALITY OF COVARIANCE MATRICES 


The application of the tests for elliptically contoured distributions is 
Wreated in Section 10.11. 


10.2. CRITERIA FOR TESTING EQUALITY OF SEVERAL 
COVARIANCE MATRICES 


In this section we study several normal distributions and consider using a set 
of samples, one from each population, to test the hypothesis that the 
covariance matrices of these populations are equal. Let x/f, a=1,...,.N,, 
g=1,...,q, be an observation from the gth population N(q'*), 2 ,). We wish 
to test the hypothesis 


(1) Ay:dpe = ky. 
Let D3_,N,=N, 

Ny 

(2) A,= dy (x) — x) (al) — HH), g=1,....9, 
a=! 
q 

A= VA,. 

ge=l 


4 
(3) L= | ee ~5 (x) = wy SS (x = p's?) : 
g 


The space © is the parameter space in which each %, is positive definite and 
w'5) any vector. The space w is the parameter space in which ¥, =, = -- 
=, (positive definite) and p’2? is any vector. The maximum likelihood 


estimators of ‘®) and &, in © are given by 


A —_ A. 1 
(4) pip =x), Zea = Wg: ga=l,..., q- 
§ 


The maximum likelihood estimators of p' in w are given by (4), pl%’ = x“), 
since the maximizing values of p'%) are the same regardless of 2,. The 


function to be maximized with respect to 2, = -- =X, = &, say, is 
goa 
@) (2or) P| S| 24 re es ( es ) 


10.2 CRITERIA FOR EQUALITY OF COVARIANCE MATRICES 413 


By Lemma 3.2.2, the maximizing value of & is 
(6) 2 = WA 


and the maximum of the likelihood function is 


] —ipN 
WN) > in’ 
(amy?) 


(7) 


The likelihood ratio criterion for testing (1) is 


A 1 4 
Tg ilSeql i Tig. l4yl% ten 


(8) = SS See 
Paka [A}e% TTS NPN 

The critical region is 

(9) A, SA,(e), 


where A,(z) is defined so that (9) holds with probability « when (1) is true. 
Bartlett (1937a) has suggested modifying A, in the univariate case by 

replacing sample numbers by the numbers of degrees of freedom of the 4,. 

Except for a numerical constant, the statistic he proposes is . 


1 
TT- 1A, a 


1 = 
( 0) V \Al ya 


’ 


where n,=N,—1 and n= L8_,n,=N-—q. The numerator is proportional 
to a power of a weighted geometric mean of the sample generalized vari- 
ances, and the denominator is proportional to a power of the determinant of 
a weighted arithmetic mean of the sample covariance matrices. 

In the scalar case ( p = 1) of two samples the criterion (10) is 


*(st)" 
ee + ns i nytny) (nF tn jhe"? : 


(11) 


where s? and s3 are the usual unbiased estimators of a, and o; (the two 
population variances) and 


(12) F=-t, 


414 TESTING HYPOTHESES OF EQUALITY OF COVARIANCE MATRICES 
Thus the critical region 
(13) V,sV,(e) 


is based on the F-statistic with n, and n, degrees of freedom, and the 
inequality (13) implies a particular method of choosing F,(2) and F,(e) for 
the critical region 


(14) F<F(e), F2F,(e). 


Brown (1939) and Scheffé (1942) have shown that (14) yields an unbiased 
test. 

Bartlett gave a more intuitive argument for the use of V, in place of Aj. 
He argues that if N,, say, is small, A, is given too much weight in A,, and 
other effects may be missed. Perlman (1980) has shown that the test based on 
V, 1s unbiased. 

If one assumes 


(15) EX(8) = B, 22). 


where z‘2) consists of k, components, and if one estimates the matrix B,, 
defining 


N 
(16) A= D (x -B,z}(x, — Bz)", 


a=! 


. 


i 


one uses (10) with n, = N,—k,. 

The statistical problem (parameter space () and null hypothesis w) is 
invariant with respect to changes of location within populations and a 
common linear transformation ° 


(17) X*EE) = CXS) 4 yl 8), g=1,...,4q, 


where C is nonsingular. Each matrix 4, is invariant under change of 
location, and the modified criterion (10) is invariant: 


18 Vi : ; = é 
n> ee |a*| 3" |cac’|*" Al?" 


Similarly, the likelihood ratio criterion (8) ts invariant. 


10.3 CRITERIA FOR TESTING THAT DISTRIBUTIONS ARE IDENTICAL 415 


An alternative invariant test procedure [Nagao (1973a)] is based on the 
criterion 


q q 

(19) 4 Din,tr(s,S-'-1) =4 Yn, tr(S,- $)S7'(S,-S)S7, 
gal 

where S,=(1/n,)A, and S$ =(1/n)A. (See Section 7.8.) 


10.3. CRITERIA FOR TESTING THAT SEVERAL NORMAL 
DISTRIBUTIONS ARE IDENTICAL 


In Section 8.8 we considered testing the equality of mean vectors when we 
assumed the covariance matrices were the same; that is, we tested 


(1) Hy: p= pw?) =: = pi?) given xX, =, ee ==: 


The test of the assumption ia H, was considered in Section 10.2. Now let us 
consider the hypothesis that both means and covariances are the same; this is 
a combination of H, and H,. We test 


(2) Hp? = p= pre (q) ~,= 2,2 fee =%,. 


As in Section 10.2, let x8), a = 1,..., N,, be an observation from N(w, &,), 
g=1,...,q. Then is the unrestricted parameter space of {w'®, %,}, g= 
1,...,q, where © , iS positive definite, and w* consists of the space restricted 
by (2). 

The likelihood function is given by (3) of Section 10.2. The hypothesis H, 
of Section 10.2 is that the parameter point falls in w; the hypothesis H, of 
Section 8.8 is that the parameter point falls in w* given it falls in w > w*; 
and the hypothesis H here is that the parameter point falls in w* given that 
it is in Q. 

We use the following lemma: 


Lemma 10.3.1. Let y be an observation vector on a random vector with 
density {(z,8), where ® is a parameter vector in a space Q. Let H, be the 
hypothesis 8E O, CO, let Hy, be the hypothesis 8E0,, CO,, given 8EN,, 
and let H,, be the hypothesis 8E0,,, given 9 ED. If A,, the likelihood ratio 
cntenion for testing H,, », for H,, and A,, for H,, are uniquely defined for the 
observation vector y, then 


(3) Aap = AgAy- 


416 TESTING HYPOTHESES OF EQUALITY OF COVARIANCE MATRICES 
Proof. The lemma follows from the definitions: 


MaXgen, f(y, 8) 


a ta MaXg en f(y,9) ’ 
MaXgen, f(y, 4) 

- 7 Bak yen, 109-8)” 

(6) Aan = maxaen, f(y, 8) a 


maxgen f(y,8) © 


Thus the likelihood ratio criterion for the hypothesis H is the product of 
the likelihood ratio criterta for H, and H,, 


[412% \ yin 
7 A=A,A,= — ) 
( ) 12 [i NvNs | Bl 2% 
where 
q Ns 
() BE (oP —8)(21P-2) 
g=| a=] 


q 
=A + DNAs?) —x)(8 =2)" 
g=l 


The critical region is defined by 


where A(e) is chosen so that the probability of (9) under H is e. 
Let 


_ [Al 
| Bl" 


(10) V; 


this is equivalent to A, for testing H>, which is A of (12) of Section 8.8. We 


might consider 


ab 
TT3_y |A,| are 


11 V=V\V,= 7 
(11) 1) iB?" 


However, Perlman (1980) has shown that the likelihood ratio test is unbiased. 


10.4 DISTRIBUTIONS OF THE CRITERIA 417 
10.4. DISTRIBUTIONS OF THE CRITERIA 


10.4.1. Characterization of the Distributions 


First let us consider V, given by (10) of Section 10.2. If 


[Ay toe by yt hea 


1 = eee = DS Gs 
O) se |, +o $A, [Ht td) : 


(2) Y= I Vig: 


g=2 


Theorem 10.4.1, V,.,V,;,---,V,, defined by (1) are independent when 
Zyp= =2, andn,2=p, g=1,...,¢. 


The theorem is a consequence of the following Jemma: 


Lemma 10.4.1. If A and B are independently distributed according to 
W(%,m) and W(%,n), respectively, n> p, m= p, and C is such that C(A + 
B)C' =I, then A+B and CAC' are independently distributed, A+B has the 
Wishart distribution with m +n degrees of freedom, and CAC' has the multivari- 
ate beta distribution with n and m degrees of freedom. 


Proof of Lemma. The density of D=A+B and E=CAC' 1s found by 
replacing A and B in their joint density by C"'EC'~'! and D- C"'EC'"! = 


Cc '(I-~E)c'~!, respectively, and multiplying by the Jacobian, which is 
mod|C|~'?*” =| DI 2(7*, to obtain 


() K(X,m)K(Z,n)|C71 EC’! | mee) 
LOU E = ECT Haren D en HE'D] DAs 
=K({x,m + n)[D[Mtttme- 1) goa ED 


Dls(m +n)] 


F JE| te“ Dp Ep atereD) 
T(am)0,(5n) 


for D, E, and I — E positive definite. a 


418 TESTING HYPOTHESES OF EQUAL TY OF COVARIANCE MATRICES 


Proof of Theorem. \f we let A, +--+ +A,=D, and C,(A; + -- +A,_,)C, 
= Ess where C,D,C, = 1, g=2,...,q, then 


[Cp TE, Cot eC Ber | 


(4) My =e ly dat tn,) 
etc F 
= |E,| et eT Ble, 22,0005, 
and E,,...,£, are independent by Lemma 10.4.1. a 


We shall now find a characterization of the distribution of V,,. A statistic 


V,, is of the form 

[BI’\cl‘ 
5 a. 
(>) |IB+c|’** 


Let B, and C, be the upper left-hand square submatrices of B and C, 
respectively, of order 1. Define b,, and ¢,,, by 


Ce 


(1) r 
(6) B= ar ; i=2,...,p- 


(7) 
|BI"Ic\° BIC | |B + C_41°** 
IB+cl|?** ray [B,_,I°1C,_41° |B, +¢,|°*¢ 


(4: — By Ba Buy) (Cin — My Con eu) 


bt+e 
(5, 1-] Cie) 
= pa | 
cee + Cia] + bi Be bi) + Cy Ci Cy 


= b+e 
— (by, cs Cus) (Bey cS C,-1) "(Bu + ew)| 


10.4 DISTRIBUTIONS OF THE CRITERIA / 419 


where bij) = bg — Diy Bib, aNd Cy4-1 = Cie — ay Ci-1€qy- The second term 
for i= 1 is defined as 1. 

Now we want to argue that the ratios on the right-hand side of (7) are 
Statistically independent when B and C are independently distributed ac- 
cording to W(X, m) and W(X, n), respectively. It follows from Theorem 4.3.3 
that for B,_, fixed ), and b,.,_, are independently distributed according to 
NB, G;,.;-,B,_') and o;;,,_, x* with m-—(i—1) degrees of freedom, re- 
spectively. Lemma 10.4.1 implies that the first term (which is a function of 
bi,.;-1/Cji-¢-,) 18 independent of byj 4) + y.1—1- 

We apply the following lemma: 


Lemma 10.4.2. For B;_, and C,_, positive definite 
(8) By Babyy + CH Vy — (Bey + ey)! (Bi + C1) (Bn + Go) 
= (Babu — Chey) (Bry + G4) | (Bb — Chen) 
Proof. Use of (B7! + C7')"! = [c7\B + C)B'y! = B(B + C)"'C 
shows the left-hand side of (8) is (omitting i and ¢—- 1) 
(9) 
b'BO' (Boh + C7) (Bo +C!)bt+e'(B'+cC'!)(B 1+ Cc) ‘cle 
—(b+e)'B'(B-' +7!) 'C"'(b +0) 
=b'B'(B' 407!) (Bb te CUB '4+C')'C'e 
—b'B'(B' +07!) Ce e'C (B+ C1) Bb, 
which is the right-hand side of (8). | 


The denominator of the ith second term in (7) is the numerator plus (8). 
The conditional distribution of Bj'b,,—C,,e,) is normal with mean 


site 
Bo'.Bu,— C1) and covariance matrix o;,;_,(Bj') + C7). The covari- 
ance matrix is o;;.;_, times the inverse of the second matrix on the right-hand 


side of (8). Thus (8) is distributed as o;;.,_, x” with i — 1 degrees of freedom, 


independent of B,_,, C;_,, 5;,.;-,, amd ¢;;.;_1. 
Then 
b c 
(10) Dien Cet = | re ( Ciri-) 
(bj-1 + Cet yee Baga t City bipi-y + Cin—y 


is distributed as X?(1—X,)°, where X, has the B[3(m—i+1),3(n—-i+ 1] 


420 TESTING HYPOTHESES OF EQUALITY OF COVARIANCE MATRICES 
distribution, i= 1,..., p. Also 


bte 


Dini + Cipi-1 - 
™ enero ie 


is distributed as Y,°**, where Y, has the B[$(m +n)—i+ 1,4€i— 1)] distribu- 
tion. Then (5) is distributed as [1?.,X?(1 —X,)°T17.,¥,?**, and the factors 
are mutually independent. 


Theorem 10.4.2. 


q f Pp i$ po : 
g= is 


i=] 


where the X’s and Y’s are independent, X;, has the B[;(n,+ - +n,_,—i+D), 
z(n, —i+ 1)] distribution, and Y,, has the B[3(n, + +n,)-i + 1,30 - 1) 
distribution. 


Proof. The factors Vigra Viq are independent by Theorem 10.4.1. Each 
term V,, is decomposed according to (7), and the factors are independent. 
ai 


The factors of V, can be interpreted as test criteria for subhypotheses. 
The term depending on X,, is the criterion for testing the hypothesis that 
of)_, = o_,, and the term depending on Y,, is the criterion for testing 
o() = of} given of! , =o,%)_,, and 2; ,;=%, 1,2. The terms depending 
on X,, and Y,, similarly furnish criteria for testing 2, = 2, given 2, = = 
Diyas 

Now consider the likelihood ratio criterion A given by (7) of Section 10.3 
for testing the hypothesis pu? = --- = py‘ anc Z| = --- = &,. It is equivalent 
to the crilerion 


iN 
Ili.) |A,| nes 


13 — 
( ) |A,+ css +A,| 0M +N,) 


|A, + +A) #9 
|A, +7 +4, + D4_, N,( 22) —¥) (x) —x)"|2% : 


The two factors of (13) are independent because the first factor is indepen- 
dent of A, +--- +A, (by Lemma 10.4.1 and the proof of Theorem 10.4.1) 
and of #"))..., #&. 


10.4 DISTRIBUTIONS OF THE CRITERIA 421 


Theorem 10.4.3 


q { P ica» 3 es, 
(14) w= TA TT xi 0 x) TI nN + Ohl TZ 


where the X’s, Y’s, and Z’s are independent, X,, has the B[}(n, + «+ tng, ~ 
i+1),3(n,-i+1)] distribution, Y,, has the BU}(n, ++ +n,)-1+ L3G- 10] 
distribution, and Z, has the B[4(n + 1 —i),4(q— 1)) distribution. 


Proof. The characterization of the first factor in (13) corresponds to that 
of V, with the exponents of X,, and 1 — X,, modified by replacing 7, by N,. 
The second term in Os. n» and its characterization follows from Theorem 
8.4.1, a 


10.4.2, Moments of the Distributions 


We now find the moments of V, and of W. Since O<V, <1 andO0<W<1l. 
the moments determine the distributions uniquely. The ith moment of V, 
we find from the characterization of the distribution in Theorem 10.4.2: 


(15) 
q Pp ; Pp 
évi= TT (Hl exe mer — XT Tevet #08) 
_ i Py lett te + =A(i=1)| 
paler Tt tna) 


‘T[an,(1 +h) — 3-1 |T[3(m + +,) -2 + 1 
T[4(n,-f+ 1)T[3(n + tn, )(1+h) ~i + 1] 


Pe [$(my to tng) +h) —i 4 UD [ 3 (a, + tn, 7 + 1) 
“pea T[4(n) + + +0,) —24 TPR + (1 +h) — $= 1] 


if 


Pf rik(eti-a] 4 Pb thn, +1 -1)] 
AN rin sim et) LE re, + 1-3] 
D{3(n + hn)| k=l T,(37,) 


422 TESTING HYPOTHESES OF EQUALITY OF COVARIANCE MATRICES 


The Ath moment of W can be found from its representation in Theorem 
10.4.3. We have 


(16) 
q Pp ; 0 ; 
gw = I] Il EXE tr NG mc =, 5 a ial EY M+ ter +N JA &U gNh 


-a-tn 
g=21=1 r=2 ? 


n PED ([4(y + tng_, 1a) + $A(N, + +N_1)] 
> end feel [s(n eat ge | —i)|T[3(n, + 1 —i)| 
T[3(n, + 1-i+N,A)T[3(n, sates 8 eet es 1] 
[a(n to tn,) + gh(N, + +N) + 1-3] 
p wer sean | sh(N, ++ +N) +1 — i 


“ee T[3(n, + -- tn,) + 1-3] 


V[$(n, + oe +n, +1—i)| 


TEE(ay, te tn, +1—-i)+ 5h(N, +> +N, )] 


p P[i(n +1 i+ ANTE ND] 
ot TEg(nt1-)]T[E(N + AN -1) 


pf 4 T[3(N, +AN, ~i)} ra(N—i)] 


i I T[3(N,- ‘| r[3(N + AN —i)] 


T,(4n) q T,[3(n, + AN,)] | 


~ T(Gn+ any gi T, (ing) 


We summarize in the following theorem: 


Theorem 10.4.4. Let V, be the criterion defined by (10) of Section 10.2 for 
testing the hypothesis that H,: 2%, = +: =, where A, is n, times the sample 
covariance matrix and n, + | is the size of the sample from the gth population; let 
W be the criterion defined by (13) for testing the hypothesis H: p= - = wy 
and H,, where B=A + ¥,N,(% —x)x'® — €)'. The hth moment of V, when 
H, is true is given by (15). The hth moment o, W, the criterion for testing H, is 
given by (16). 


This theorem was first proved by Wilks (1932). See Problem 10.5 for an 
alternative approach. 


10.4. DISTRIBUTIONS OF THE CRITERIA 423 


If p is even, say p = 2r, We can use the duplication formula for the gamma 
function ([[(a + ))P(a + 1) = ¥rT(at 12-77]. Then 


I'(n, +hn, a | EET ON _T(n+1-27) _ | 


(17) ita 8 | I'(n, +1-~2j) P(nt+hn+1—2{) 


and 


pec ee q T(n, + AN, + 1-2)) T'(N - 2)) 
es ~1H{{ M4 I'(n, + 1-2) | peat 


In principle the distributions of the factors can be integrated to obtain the 
distributions of V,; and W. In Section 10.6 we consider V, when p = 2, q = 2 
(the case of p = 1,q = 2 being a function of an F-statistic). In other cases, 
the integrals become unmanageable. To find probabilities we use the asymp- 
totic expansion given in the next section. Box (1949) has given some other 
approximate distributions. 


The characterizations of the distributions of the criteria in terms of indepen- 
dent factors suggests testing the hypotheses H, and H by testing component 
hypotheses sequentially. First, we consider testing H,:%,=%, for q=2. 


Let 
(8) (g) (g) (8) 
(19) xX X® = Koen p's) = Bom 3 (8) ey OG 
a= () ’ ‘pas ry 
xe) ute) of! of) 


The conditional distribution of X{% given Xi), = x{?),, ‘ie 
oo N[ wi + 08 B7(xPy — wy) os], 


where o,{8) ) = 0;{8) — of%)'{, of). It is assumed that the components of X 
have been numbered in ‘descending order of importance, At the ith step the 
component hypothesis o;{!)_, = 0,!?_, is tested at significance level e; by 
means of an F-test based on s{) _, /s_,; S, and S, are partitioned like = 
and &), If that hypothesis is accepted, then the hypothesis of? = 0 (or 
EfPr lo) ==2Pr'oP) is tested at significance level 6, on the assumption 
that {2 = 50, (a hypothesis pr2viously accepted). The criterion is 


IN-1 0 2-1 ,.(2) 1)~1 2)71 1-I,0 2)-1,(2 
(Sos) SOF sO Sa Serr)” (5; si) — SPR 's®) 


(21) GS Dsna 


424 TESTING HYPOTHESES OF EQUALITY OF COVARIANCE MATRICES 


where (n, tn, —2i+2)s,,;_, =(n, —i+ DsQ_, + (ny —2 + Us _,. Under 
the null hypothesis (21) has the F-distribution with i— 1 and n, +n,—2i+2 
degrecs of freedom. If this hypothesis is accepted, the (i + 1)st step is taken, 
The overall hypothesis %, = 2%, is accepted if the 2p —1 component hy- 
pocheses are accepted. (At the first step, o{f? is vacuous) The overall 


significance level is 
Pp Pp 

(22) 1— [1G -4) TYG ~ 6). 
is i= 


If any component null hypothesis is rejected, the overall hypothesis is 
rejected. 

If q> 2, the null hypotheses H,:%, = =, is broken down into a 
sequence of hypotheses [1/(g — KX, +--+ +%,_,) = &, and tested sequen- 
tially. Each such matrix hypothesis is tested as &, =X, with S, replaced by 
S, and S, replaced by [1/(m, +--+ +n,_, JA, + + +Ag_)). 

In the case of the hypothesis H, consider first g=2, %,=2,, and 
pw) = x, One can test &, =. The steps for testing pw! = pw consist of 
t-tests ea (1) = y@ based on the conditional distribution of Xf? and X? 
given x!) and x2) 1)- Alternatively one can test in Seauctice the aval of 
the conditional distributions of X and X given x{P., and xf). 

For q > 2, the hypothesis £, = --- = %, can be tested, and then py = 
= ,. Alternatively, one can test Tie DMS, +--+ +2,-,)= 2, and 
[1/(g = I] fone ps) 2 pe), 


10.5. ASYMPTOTIC EXPANSIONS OF THE DISTRIBUTIONS 
OF THE CRITERIA 


Again we make use of Theorem 8.5.1 to obtain asymptotic expansions of the 
distributions of V, and of A. We assume that n,=k,n, where 17_,k, = 1. 
The asymptotic expansion is in terms of n increasing with k,, ..,k, fixed. 
(We could assume only lim n,/n =k, > 0.) 


The Ath moment of 


is 
; h 
TIP. (gn) Ti. ,T17.,T[3n,(1 +h) + 3(1-4)] 


2) é¢' =K| ——— 
OO Tati Gay) WTEC +) + F0-D) 


10.5 ASYMPTOTIC EXPANSIONS OF DISTRIBUTIONS OF CRITERIA 425 


This is of the form of (1) of Section 8.6 with 


b=p, y,=an, =3(1-f), f=l,..sp, 
(3) a@=pq,  xp=3Ny, k=(g-l)pt+l.....gp. gel... q. 
é=3(1-2), k=i,pti,...q—-Ipti. i=l,.-.p 
Then 
(4) f=-210&- Ln - 3(2-)| 


i 


Pp P 
=|¢h) G=i)=- 2b =f) = (ep) 
i=] }=) 


It 


-(-43p(p-1) +3p(p-1) - (ap 
=7(¢-1)P(p+1), 


= 5 — pin, f= 1,...,p, and B, = 30 - p)n, = 30 ~ p)k,n, k =(g-1)p 


ge eee 2 
In order to make the second term in the expansion vanish, we take p as 


es ‘| 2p°+3p=1 
5 =1-| }—- |e 7 
©) 5 [E ne ")6(p+1)(q-1) 


Then 

P(P +n|te- (Cp +2) y ~; = | - 6(¢-1)(1 - oY 
(6) w= ae : 
Thus 


(7) Pr{-2plog At <z} 
= Pr{ x? <2} + 0,[Pr{ x74 <z} ~Pr{ x7 <z}| + O(n). 
Let A= WN#® Thy N iN; The Ath moment is 
1 A 
ITfai(3N in Mé_, TP Plan) +A) ~ 43 
(8) @"=K iG 2 IN, ni am aa. 
Tipe TT i1(2Ne)” jet L2 a, 


z 


r= 


426 TESTING HYPOTHESES OF EQUALITY OF COVARIANCE MATRICES 


This is the form (1) of Section 8.5 with 


b=p. y= aN = 3 oN n= — 4a) J=l.eesDs 

g=l 
(9) @=pgq, x,= 3N,, k=(g-1)p+1,..., gp, g=l,...,q, 
&, = —4i. k=i,pti,....(q-lpti, i=1,...,p. 


The basic number of degrees of freedom is f = $p( p + 3Xq — 1). We use (11) 
of Section 8.5 with 8, =(1—p)x, and e,=(1 - p)y,. To make w, =0, we 
take 


q 2 
1 1 2p°41-9n+11 
(10) pni-[E | aw 


aN, N) 6(g-1(p +3) 
Then 
ay) 2 
(11) on = AEA) & [ae ~ ele + DU e+2)- 1-9-1) 


The asymptotic expansion of the distribution of —2p log A is 
(12) Pr{-—2plog A<z} 
= Pri x? <2} + w[Pr{ Via <2] —Pr{ x? <2z}| + O(n73). 


Box (1949) considered the case of Af in considerable detail. In addition to 
this expansion he considered the use of (13) of Section 8.6. He also gave an 
F-approximation. 

As an example, we use one given by E. S. Pearson and Wilks (1933). The 
measurements are made on tensile strength (X,) and hardness CX.) of 
aluminum die castings. There are 12 observations in each of five samples. 
The observed sums of squares and cross-products in the five samples are 


A, ={ 78.948 214.18 
1" (214.18 1247.18)’ 

A, = | 223.695 ad 
: 657.62 2519.31 }’ 

57.448 190.63 

3 = 

a) Ay eee ea 
Pr a | 
4 \ 375.91 1473.44)" 

A -| 88.456 a 
$4259.18 1171.73)’ 


10.6 THE CASE OF TWO POPULATIONS 427 


and the sum of these is 


_{ 636.165 1697.52 
(18) te 1697.52 a 


The —log Ai is 5.399. To use the asymptotic expansion we find p= 152/165 
= 0.9212 and w, = 0.0022. Since w, is small, we can consider — 2 log At as 
x? with 12 degrees of freedom. Our observed criterion, therefore, is clearly 
not significant. 

Table B.5 [due to Korin (1969)] gives 5% significance points for — 2log A* 
for Nj = + =N, for various q, small values of N,, and p = 2(1)6. 

The limiting distribution of the criterion (19) of Section 10.1 is also x. An 
asymptotic expansion of the distribution was given by Nagao (1973b) to terms 
of order 1/n involving y?-distiibutions with f, f+2, f+4, and f+6 
degrees of freedom. 


10.6. THE CASE OF TWO POPULATIONS 


10.6.1, Invariant Tests 


When q = 2, the null hypothesis H, is %, = 2. It is invariant with respect to 
transformations 


(1) Va py, ACM +y%, 


where C is nonsingular. The maximal invariant of the parameters under the 
transformation of locations (C = J) is the pair of covariance matrices %,, X9, 
and the maximal invariant of the sufficient statistics ¥“, $,,¥,S, is the 
pair of matrices $,,5, (or equivalently A,,A,). The transformation (1) 
induces the transformations X{=C2,C', 2} =Cxz,C', Sf =CS,C’, and 
SJ = CS,C’. The roots A) 2A, 2°" 2A, of 


(2) |X, - Ax] =0 
are invariant under these transformations since 
(3) |Z* — Ade] =|CX,C' - ACE,C'| =|CC'|-|E, — Az,I. 


Moreover, the roots are the only invariants because there exists a nonsingular 
matrix C such that 


(4) CX,C'=A,  CX,C'=I, 


where A is the diagonal matrix with A, as the ith diagonal element, 
i=1,..., p. (See Theorem A.2.2 of the Appendix.) Similarly, the maximal 


428 TESTING HYPOTHESES OF EQUALITY OF COVARIANCE MATRICES 
invariants of S, and S, are the roots /; >/,> --- J, of 
(5) IS, —1S,| =0. 


Theorem 10.6.1. The maximal invariant of the parameters of N(p\”,%,) 
and N(w, %,) under the transformation (1) is the set of roots A, = + = A, of 
(2). The maximal invariant of the sufficient statistics x“, S,, x, S, is the set of 
roots 1, > + Bl, of (5). 


Any invariant test critcrion can be cxpressed in terms of the roots 
. . . i { . 
Lyesey dy. The criterion V, is n??"'n#?”? times 


F Isls ieee Pd 

(8) [n)S, +8)" [nyLtn tl ret (nl, +n)?” 

where L is the diagonal matrix with J, as the ith diagonal element. The null 
hypothesis is rejected if the smaller roots are too small or if the larger roots 
are too large, or both. 

The null hypothesis is that A, = --+ = A, = 1. Any useful invariant test of 
the null hypothesis has a rejection region in the space of /,,...,/, that 
includes the points that in some sense are far from /, = --- =/, = 1. The 
power of an invariant test depends on the parameters through the roots 
Ajseeey Ay. 

The criterion (19) of Section 10.2 is (with nS =n,S, +n,S,) 


(7) gmt [(S) ~$)S~']° + grate [(S,-8)S~']° 
= tn, [C(S, -S)e'(ese’)"]’ 


-172 
+4n,tr[C(S,-S)c'(cSC') "| 
1 ny no ny n\} : 
=!n tr {1 -(Fn+ az) (StL i. “2 1) 
= n n n n 
ally pera Hy ee ah 
+ 3n, tT {r- (5 ale = 1) 


P(t, =1)? 
Sing ee. 
i=1 (Ml, +n) 


This criterion is a measurc of how close /,,..., l, arc to |; the hypothesis is 


rejected if the measure is too large. Under the null hypothesis, (7) has the 
x?-distribution with f= 4p(p+1) degrees of freedom as n, > 0, n> 0, 


10.6 THE CASE OF TWO POPULASIONS 429 


and n,/n, approaches a positive constant. Nagao (1973b) gives an asymptotic 
expansion of this distribution 10 terms of order 1/n. 

Roy (1953) suggested a test based on the largest and smallest roots, /, and 
[,, The procedure is to reject the null hypothesis if 7, >, or if i,<k,, 
where k, and k, are chosen so that the probability of rejection when A=J 
is the desired significance level. Roy (1957) proposed determining x, and k,, 
so that the test is locally unbiased, that is, that the power functions have a 
telative minimum at A =I. Since it is hard to determine k, and k, on this 
basis, other proposals have been made. The linit k, can be determined so 
that Pr{/,>,|H,) is one-half the significance level, or Prii, <k,|Hy} is 
one-half of the significance level, or k, +k, = 2, or k,k, = 1. In principle k, 
and k,, can be determined from the distribution of the roots, given in Section 
13.2. Schuurmann, Waikar, and Krishnaiah (1975) and Chu and Pillai (1979) 
give some exact values of k, and k, for small values of p. Chu and Pillai 
(1979) also make some power comparisons of several test procedures, 

In the case of p = 1 the only invariant of the sufficient statistics is $,/S2, 
which is the usual F-statistic with n, and n, degrees of freedom. The 
criterion V, is(A,/A,)*"[1 +A,/A2)]~ *; the critical region V, less than a 
constant is equivalent to a two-tailed critical region for the F-statistic. The 
quantity n(B—-A)/A has an independent F-distribution with 1 and nv de- 
prees of freedom. (See Section 10.3.) 

In the case of p = 2, the Ath moment of V, is, from (15) of Section 10.4. 


T(n, + hn, - 1)T(n. + hn, -1)P(n-1) 
{ k 1 ] = ad 
8) on T(n, - ll (n,- 1)l(n thn ~1) 


= [xm —x)U xen", 


where X, an} X, are independently distributed according to A(x|n, — 1. 
n,—1) and f(x|n, +n, — 2,1), respectively. Then Pr{V, < v} can be found by 
integration. (See Problems 10.8 and 10.9.) 

Anderson (1905a) has shown that a confidence interval for a’X ,a/a't.a 
for all a with confidence coefficient ¢ is given by (1,/U,/,/L), where 
Pr{(n, ~?P ay WL Sha P aja per Pr{(n, a eg 1) Fs p41, ny <nU} se Le 


10.6.2, Components of Variance 


In Section 8.8 we considered what is equivalent to the one-way analysis of 
variance with fixed effects. We can write the model in the balanced case 
(N, =N,= °" =N,) as 


(B) = yf) (gz) 
(9) xt +U 


=pty, + UL), a=1,...,.M, g=l,...,q. 


430 TESTING HYPOTHESES OF EQUALITY OF COVARIANCEMATRICES 


where 6U) =O and 6U‘9)U' = ¥, vp, = wl!) — p, and p= (1/@)dt.j a 
(L4.,¥, = 0), The null hypothesis of no effect is v)= + =v, =0. Let 
¥) = (1 /M)LM x and x = (1/q)D4_, x. The analysis of variance table 
1S 


Degrees of 
Source Sum of Squares Freedom 
q 
Effect H=M 3 (x8) ~ xx — x) q-1 
g=l 
4 M 
Error G= PY le — x89) — x) q(M - 1) 
g=1 a=] 
q M 
Total YY = x - xy qM —1 
g=1 a=l1 


Invariant tests of the null hypothesis of no effect are bised on the roots of 
|H~mG| =0 or of |S, ~iS,]=0, where S,=[1/A(q- IH and S,= 
[1/q(M — 1)]G. The null hypothesis is rejected if one or more of the roots is 
too large. The error matrix G has the distribution W(X, q(M — 1)). The 
effects matrix H has the distribution W(%,q — 1) when the null hypothesis is 
true and has the noncentral Wishart distribution when the null hypothesis is 
not true; its expected value is 


(10) éH=(q-1)E4M¥ (w- w)(w? wy 


gu 


nek 
=(q-I)X+MY vv. 
gel 


The MANOVA model with random effects is 
(11) Xe pty, + UM, a=1,...,.M, g=1,...,4q, 


where V, has the distribution N(0,@). Then X{8) has the distribution 
N(p, X + @). The null hypothesis of no effect is 


(12) ®=0. 


In this model G again has the distribution W(%, q(M — 1)). Since X©) = p+ 
V,+U‘) has the distribution N(w,(1/M)& + @), H has the distribution 
W(X + M@,q-— 1). The null hypothesis (12) is equivalent to the equality of 


10.7 TESTING HYPOTHESIS OF PROPORTIONALITY; SPHERICITY TEST 431 


the covariance matrices in these two Wishart distributions; that is, 3 = + 
M®@. The matrices G and H correspond to A, and A, in Section 10.6.1. 
However, here the alternative to the null hypothesis is that (X + M@)— & is 
positive semidefinite, rather than 2, #X,. The null hypothesis is to be 
Tejected if H is too large relative to G. Any of the criteria presented in 
Section 10.2 can be used to test the null hypothesis here, and its distribution 
under the null hypothesis is the same as given there. 

The likelihood ratio criterion for testing @ = 0 must take into account the 
fact that @ is positive semidefinite; that is, the maximum likelihood estima- 
tors of 2 and & +M@ under | must be such that the estimator of © is 
positive semidefinite. Let /,; >/,> +++ >1,, be the roots of 


(Note {1/[q(M— 1)]}G and (1/q)H maximize the likelihood without regard 
to @ being positive definite.) Let [fF =J; if 1, >1, and let * =1 if J; <1. 
Then the likelihood ratio criterion for testing the hypothesis © = 0 against 
the alternative @ positive semidefinite and © + 0 is 


jie n [et sae 1" 
14 Miatp ——_—_+_—_—__—. = Mz +. 
“ int (1* +M—1)2™ im (1, +M— 1) 


where & is the number of roots of (13) greater than 1. [See Anderson (1946b), 
(1984a), (1989a), Morris and Olkin (1964), and Klotz and Putter (1969). ] 


10.7, TESTING THE HYPOTHESIS THAT A COVARIANCE MATRIX IS 
PROPORTIONAL TO A GIVEN MATRIX; THE SPHERICITY TEST 


10.7.1. The Hypothesis 


In many Statistical analyses that are considcred univariate, the assumption is 
made that a set of random variables are indepeudent and have a common 
variance. In this section we consider a test of these assumptions based on 
Tepeated sets of observations. 

More precisely, we use a sample of p-component vectors x,,...,X, from 
N(w, %) to test the hypothesis H: &=o7J, where co” is not specified. The 
hypothesis can be given an algebraic interpretation in terms of the character- 
istic roots of &, that is, the roots of 


(1) |= —- dll =0. 


432 TESTING HYPOTHESES OF EQUALITY OF COVARIANCE MATRICES 


The hypothesis is true if and only if all the roots of (1) are equal.’ Another 
way of putting it is that the arithmetic mean of roots q,,...,¢, is equal to 
the geometric mean, that is, 


: reo [yl 
(2) Ehid/p  wd/p~! 


The lengths squared of the principal axes of the ellipsoids of constant density 
are proportional to the roots ¢, (see Chapter 11); the hypothesis specifies 
that these are equal, that is, that the ellipsoids are spheres. 

The hypothesis H is equivalent to the more general form W = o? W, with 
W specified, having observation vectors y,,...,¥, from N(v, W). Let C be 
a matrix such that 


(3) CWC’ =, 


and let p*=Cv, 2* =CWC', x2 =Cy,. Then xf,...,x} are observations 
from N(p*,=*), and the hypothesis is trunsformed into H:%* = oJ. 


10.7.2. The Criterion 


In the canonical form the hypothesis H is a combination of the hypothesis 
H,:% is diagonal or the components of X are independent and H,:the 
diagonal elements of & are equal given that & is diagonal or the variances of 
the components of X are equal given that the components are independent. 
Thus by Lemma 10.3.1 the likelihood ratio criterion A for H is the product of 
the criterion A, for H, and A, for H,. From Section 9.2 we see that the 
criterion for H, is 


in 
“) is nae = Ges 
where 
N 
(5) A= D0 (tq ¥) (4a -2)' = (ay) 


a=l 


and Wij = jj Yin, - We use the results of Section 10.2 to obtain A, by 
considering the ith component of x, as the ath observation from the [th 
population. (p here is q in Section 10.2; N here is N, there; pN here is N 


‘This follows from the fact that £ = O’®O, where ® is a diagonal matrix with roots as diagonal 
elements and @ is an orthogonal matrix. 


10.7 TESTING HYPOTHESIS OF PROPORTIONALITY; SPHERICITY TEST 433 


there.) Thus 
M1 [E.(%0-%,)')" 
(6) a eS RT 
peer —x,) /p] 
Ta?’ 
(tr A/p)?” | 
Thus the criterion for H is 
|A|? 
7 A=A,A,= 
(7) ayy 


It will be observed that A resembles (2). If J,,...,/ p are the roots of 
(8) IS -—u| =0, 


where S=(1/n)A, the criterion is a power of the ratio of the geometric 
mean to the arithmetic mean, 


(9) 


The spn 
: | L/P | 


Now let us go back to the hypothesis W =o? W,, given observation 
vectors y,,..-,)y from N(v,W). In the transformed variables {x*} the 
criterion is |A*| ?¥(tr A* /p)~ #”", where 


(10) A* = 3 (xo ~x")(xg - x") 


where 


N 
(11) B= Yi (y-5) (9, -9)'. 
a=l 


434 TESTING HYPOTHESES OF EQUALITY OF COVARIANCE MATRICES 


From (3) we have Wy, =C7'(C')"' =(C'C)!. Thus 


|B 
A* = = By?! - 
al = ppp = lB wa 
(12) tr A* = tr CBC’ =tr BC'C 
=tBwW,'. 
The results can be summarized. 
Theorem 10.7.1. Given a set of p-component observation vectors y,,..., ¥y 


from N(w,W), the likelihood ratio criterion for testing the hypothesis H: ¥ = 
ao’ W,, where W, is specified and o° is not specified, is 


- 1, iN 
i ie 
(tr BW,’ /p)” 

Mauchly (1940) gave this criterion and its moments under the null 
hypothesis. 

The maximum likelihood estimator of o* under the null hypothesis is 
tr BWy' /(pN). which is tr A/( pN) in canonical form; an unbiased estimator 
is BW)! A p(N-1)] or tr AA p(N—-1)] in canonical form [Hotelling 
(1951)]. Then tr BW>'/o? has the y?-distribution with p(N — 1) degrees of 
freedom. 


10.7.3. The Distribution and Moments of the Criterion 


The distribution of the likelihood ratio critcrion under the null hypothesis 
can be characterized by the facts that A= A,A, and A, and A, are indepen- 
dent and by the characterizations of A, and A,. As was observed in Section 
7.6. when = is diagonal the correlation coefficients {7,} are distributed 
independently of the variances {a,,/(N — 1}. Since A, depends only on {r;,} 
and A, depends only on {a,,}, they are independently distributed when the 
null hypothesis is true, Let W=A°/%, W, = AI/N, W, = A3/". From Theorem 
9.3.3, we see that W, is distributed as [1?_,X,, where X,,...,X, are 
independent and X, has the density B[xl4(n—-i+ 1),4G- 1)], where n= 
N- 3. From Theorem 10.4.2 with W,=p?V/?/", we find that W, is dis- 
tributed as p’T1?.,¥)'(1—Y,), where Y,,...,¥, are independent and Y, 
has the density B(yl5n(j - 1),47), Then W is distributed as W,W,, where W, 
and W, are independent. 


10.7 TESTING HYPOTHESIS OF PROPORTIONALITY; SPHERICITY TEST 


435 


The moments of W can be found from this characterization or from 


Theorems 9.3.4 and 10.4.4. We have 


re(in) (2 +h) 


n_ 1? Ggn)  tplanhee) 

(14) EW, = T’(gn +h) l,(4n) ’ 
(5m + AYE (Zpn) 

15 & hs Wp Na are) 
( ) ¢ Ww; P T?(4n)0 (ipn + ph) 
It follows that 

u i 
(16) wh apie E(Gpn) Ty(an +h) 


T(gpnt+ph) T,(3n) 


For p = 2 we have 


(17) éWi = Fon ony LI l[z(n+1-i)] 


=(n=1) f'z"-?*?" de, 
0 


rn) A Tia(n+1-i) tal 


by use of the duplication formula for the gamma function. Thus W is 
distributed as Z*, where Z has the density (n —1)z"~?, and W has the 


density 4(n — 1)w#-9), The cdf is 


(18) Pr{W <w} =F(w) = wi), 


This result can also be found from the joint distribution of /,,/,, the roots of 
(8). The density for p = 3, 4+, and 6 has been obtained by Consul (1967b). See 


also Pillai and Nagarsenkar (1971). 


10.7.4. Asymptotic Expansion of the Distribution 


From (16) we see that the rth moment of W?" = Z, Say, is 


Me Tfan(1 +r) +3(1-4)] 


T[ipni +r)| : 


(19) EZ! = Kpi"?" 


436 TESTING HYPOTHESES OF EQUALITY OF COVARIANCE MATRICES 


This is of the form of (1), Section 8.5, with 


a=p, x, =n, &,=4(1-k), k=1,...,p, 
(20) P k7 2 ea a ) P 


b=1, y= zp, m=0. 


Thus the expansion of Section 8.5 is valid with f= 4p(p + 1)- 1. To make 
the second term in the expansion zero we take p so 


_ 2p?t+pt2 
(21) 1-p= He. 
Then 
(22) = LP * 2) P= 1)(p ~ 2) 2p" + 6p" + 3p +2) 
2 288 p?n’p? ° 


Thus the cdf of W is found from 


(23) Pr{-—2plogZ <z} 
=Pr{ -nplogW <z} PA 


= Pr{ x? <z} + w,(Pr{ x7,4 <2} - Pr{ x7 <z}} + O(n). 
Factors c(n, p, €) have been tabulated in Table B.6 such that 


(24) Pr{—nplogW <c(n, p,€) x’ inety-1(€)} = &. 


2P 


Nagarsenkar and Pillai (1973a) have tables for W. 


10.7.5. Invariant Tests 


The null hypothesis H: & = @7J is invariant with respect to transformations 
X* =cQX+v, where c is a scalar and Q is an orthogonal matrix. The 
invariant of the sufficient statistic under shift of location is A, the invariants 
of A under orthogonal transformations are the characteristic roots /,..., 1, 
and the invariants of the roots under scale transformations are functions 
that are homogeneous of degree 0, such as the ratios of roots, say 
1, /ly,...,1,-1/1,- Invariant tests are based on such functions, the likelihood 
ratio criterion is such a function. 


10.7. TESTING HYPOTHESIS OF PROPORTIONALITY; SPHERICITY TEST 437 


Nagao (1973a) proposed the criterion 


(25) dntr(s- “57 Pe(s- S81) 2 


oe 


P P(l—-d ; 
— +n es = a —p Berend 1) : 
(ie if.) t=} / 


where /= L?_,1,/p. The left-hand side of (25) is based on the loss function 
L,(%,G) of Section 7.8; the right-hand side shows it is proportional to the 
square of the coefficient of variation of the characteristic roots of the sample 
covariance matrix S. Another criterion is /,//,. Percentage points have been 
given by Krishnaiah and Schuurmann (1974). 


10.7.6. Confidence Regions 


Given observations y,,...,¥y from N(v, W), we can test W = 0° W, for any 
specified W,,, From this family of tests we can set up a confidence region for 
YW. If any matrix is in the confidence region, all multiples of it are. This kind 
of confidence region is of interest if all components of y, are measured in 
the same unit, but the investigator wants a region independent of this 
common unit. The confidence region of confidence |] — € consists of all 
matrices W* satisfying 


IBWe'| 


26 sg 
ey [(tr Br!) /p]” 


> WX (e), 


where ACe) is the e significance level for the criterion. 
Consider the case of p=2. If the common unit of measurcment is 


irrelevant, the investigator is interested in t= g/d, and p= tr / VW, Y2 - 
In this case 


(27) cece 1 Woo -* py bn Wry 
brivn(1 —p*) — PV Wr bx Wi 
1 1 = pvt 


du(1 — p7) | ~pvz T 


438 TESTING HYPOTHESES OF EQUALITY OF COVARLANCE MATRICES 
The region in terms of 7 and p is 


(28) 4 (bubx - bi2){1 ~p’) 
a 2 
(6, + Thy — 2pv7b,2) 


>A/N(e). 
Hickman (1953) has given an example of such a confidence region. 


10.8. TESTING THE HYPOTHESIS THAT A COVARIANCE 
MATRIX IS EQUAL TO A GIVEN MATRIX 


10.8.1. The Criteria 


If ¥ is distributed according to N(v, W), we wish to test H, that W = Wp, 
where W, is a given positive definite matrix. By the argument of the 
preceding section we see that this is equivalent to testing the hypothesis 
H,:2=IJ, where & is the covariance matrix of a vector X distributed 
according to N(w.%). Given a sample x,,..., x,y, the likelihood ratio crite- 
rion is 

max, L(p, J) 


ue ‘> Tax, y LOWS) 


where the likelihood function is 


— 


' ina N 
(2) L(w, 3) = (20) "13 iN oxo|~ 5 Y (x4, — Be) 2 (x ~ w)I- 


a=} 


Results in Chapter 3 show that 


_ (2m) exp[ - 35% 14 ¥)'(¥e-¥)] 


o . (2m) NCL /N) Al en 
= (SMa et, 

where 

(4) A= ¥i(x,-¥)(x,-X)'. 


Sugiura and Nagao (1968) have shown that the likelihood ratio test is biased, 
but the modified likelihood ratio test based on 


2 \ apa \ \ 1 gn 
(5) w= (5) JAle7 ett = ee ([Sle—" Fy? , 


10.8 TESTING THAT A COVARIANCE MATRIX IS EQUAL TO A GIVEN MATRIX 439 
where S =(1/n)A, is unbiased. Note that 

(6) — 2 tog A* = tr S— log|S| ~p=L,(F, S), 

where L,(I, S) is the loss function for estimating J by S defined in (2) of 


Section 7.8. In terms of the characteristic roots of S the criterion (6) is a 
constant plus 


3 


(7) Et, los TH aps ys (ae eay 
i=] i=l 


for each i the minimum of (7) is at /; = 1. 

Using the algebra of the preceding section, we see that given y),..., yy as 
observation vectors of p components from N(v, W), the modified likelihood 
ratio criterion for testing the hypothesis H,: ¥ = W,, where Wy is specified, 
is 


4pn ' \ -1 
(8) At os (=) BW? xn eg 2 BW : 
where 

N 
(9) Bee (i) Oe) 


a= ] 


10.8.2. The Distribution and Moments of the Modified Likelthood 
Ratio Criterion 


The null hypothesis H,:2%=J is the intersection of the null hypothesis of 
Section 10.7, H:%=o7J, and the null hypothesis o?=1 given Y= 07. 
The likelihood ratio criterion for H, given by (3) is the product of (7) of 
Section 10.7 and 


ao) (24) peasy 


which is the likelihood ratio criterion for testing the hypothesis o* = 1 given 
X= 07. The modified criterion Af is the product of |A|2"/(tr A /p)?" and 


L 

(11) [nay ev it A+ don, 
pr ; 

these two factors are independent (Lemma 10.4.1), The characterization of 


the distribution of the modified criterion can be obtained from Section 


440 TESTING HYPOTHESES OF EQUALITY OF COVARIANCE MATRICES 


10.7.3. The quantity tr A has the y?-distribution with np degrees of freedom 
under the null hypothesis. 

Instead of obtaining the moments and charactcristic function of A* [de- 
fined by (5)] from the preceding characterization, we shall find them by use 
of the fact that A has the distribution W(%,n). We shall calculate 


' h 
2pn | i] 
(12) EN" = f-S(& |A| 2" owl w(AlX,n) dA 
ewnh I! d 
= mi fal et AW ALY n) dA. 
mn 
Since 
(13) 


|A| }(ntnh-p-1) 4- $(r 27 A+ir AA) 


|A| A en Ber Ay, AlE,n = 
lee 21311, (hn) 


er [gn(1 +h) 
IS-! + All Hato yp (Ln) 


[S-) + Al Xmen) 4| Un+enh-p-?t) a dis(2"! +AT)A 


a Qumran (on(d +h)] 


_ 20m SPT [inl +h)] 
[P+ Ad[ 2 "OT (dn) 


-w(Al(Z7! +AN),n +nh}, 
the Ath moment of Aj is 


(14) sagt = (22) [ele PEE(n + nh +1 -))] 
ee 


n P+ Aspe? rid(nt1—s)] | 
Then the characteristic function of —2log A* is 


(15) BeW 218M = Bye? 


Dey S| P[3(m+1-J) ~ int] 
= [=] a ee Ll ae ap 
n \¥— 2itd|? j=l r[3(n+1 i) 


10.8 TESTING THAT A COVARIANCE MATRIX IS EQUAL TO A GIVEN MATRIX 441 


When the null hypothesis is true, 2 = J, and 


(16) 
* a wd n=-2int P T Hi + 1 ~yy— int 
Ben mnent — (=) (1 —2it) sp(n—2 Tl La(n I) L 
. jet T[E(n+1-) 
This characteristic function is the product of p terms such as 


tin tan V[E(n + 1-j) - int] 


(17) @,(t) = (FE) "(a= ait “Tael=p] 


Thus —2log Aj is distributed as the sum of p independent variates, the 
characteristic function of the jth being (17). Using Stirling’s approximation 
for the gamma function, we have 


(18) b(t) oe Qrint eWintyint(] = 2iryeiern 
elas -))-ia L(y Gt eae 0 ed int | ra-j)—cnr 
eM L(y — j + yr 


oe or 


t(j 1) 
= (1-2it)” ‘h- po 


(1 ~ n(L a ay | 


AS Ay — 00, b(t) —+ (1 —2it)~ ¥, which is the characteristic function of x 
(x? with j degrees of eedont): Thus —2log Af is asymptotically distributed 
as Lf xy , Which is x? with L*,j7=3p(p + 1) degrees of freedom. The 
distribution of Af can be further ‘expanded [Korin (1968), Davis (1971)] as 


19) Pr{ —2plog A} <z 
( Bay 


= Pr{ xp <z} + aa (rl xfs <z}- Pri x/ <z}) +O(N7*). 
p 


= p(2p*+ 6p? +p’? — 12p— 13) 
ey) ae 288( p + 1) 


442 TESTING HYPOTHESES OF EQUALITY OF COVARIANCE MATRICES 


Nagarsenker and Pillai (1973b) found exact distributions and tabulated 5% 
and 1% significant points, as did Davis and Field (1971), for p = 2(1)10 and 
n = 6(1)30(5)50, 60, 120. Table B.7 [due to Korin (1968)] gives some 5% and 
1% significance points of —2log A* for small values of n and p = 2(1)10. 


10.8.3. Invariant Tests 


The null hypothesis H:%=J is invariant with respect to transformations 
X* =QX +, where Q is an orthogonal matrix. The invariants of the suffi- 
cient statistics are the characteristic roots /,,...,/,, of S, and the invariants of 
the parameters are the characteristic roots of &. Invariant tests are based on 
the roots of S; the modified likelihood ratio criterion is one of them. Nagao 
(1973a) suggested the criterion 


P 
(22) tntr(S—1) =4n ¥ (1,-1)’. 
i=} 


Under the null hypothesis this criterion has a limiting y?-distribution with 
4p(p + 1) degrees of freedom. 

Roy (1957), Section 6.4, proposed a test based on the largest and smallest 
characteristic roots /, and /,: Reject the null hypothesis if 


(23) L,<l or l>u, 
where 
(24) Pr{i<i,,l) <ulZ=I}=1l-e 


and e is the significance level. Clemm, Krishnaiah, and Waikar (1973) give 
tables of w= 1/1. See also Schuurman and Waikar (1973). 


10.8.4. Confidence Bounds for Quadratic Forms 


The test procedure based on the smallest and largest characteristic roots can 
be inverted to give confidence bounds on qudaratic forms in %. Suppose nS 
has the distribution W(2,n). Let C be a nonsingular matrix such that 
Z=C'C. Then nS* =nC'~'SC™' has the distribution WU, n). Since 1* < 
a’S*a/a'a <I¥ for all a, where 1? and /f are the smallest and largest 
characteristic roots of S* (Sections 11.2 and A.2), 


a'S*a 
(25) Pr{I< ~ <u va #0\=1-e, 


where 


(26) Pr{i<i* <I} <u}=1-e. 


10.8 TESTING THAT A COVARIANCE MATRIX IS FOQUAL TO A GIVEN MATRIX 443 


Let a@ = Cb. Then a'a =b'C'Ch=)'S 5b and a’S*a=b'C'S* Cb = b'Sb. Thus 
(25) is 


(27) 1-e=Pr{l's prep <u vb + 0} 
=pr{ 2 <b'eb <2 ve}. 


Given an observed S, one can assert 


_ <b'xb <2 Vb 


(28) 


with confidence 1 — e. 
If b has 1 in the ith position and 0’s elsewhere, (28) is s,,/u < o,, <5,,/l. If 


b has 1 in the ith position, —1 in the jth position, i#j, and 0’s elsewhere, 
then (28) is 

5, +35;,— 25, 5, +5,,—-28 
(29) Sa ee ee es 


Manipulation of tnesc inequalities yields 


5, 5, +5); | 1 5 Sy +5 1 1 . . 
(30) p-ty4(7- |) <« <7 + r4(t- a) iF]. 


We can ootain simultaneously confidence intervals on all elements of &. 
From (27) we can obtain 


1 b’Sb b'Xb 1 b'Sb 
(31) 1-e=Pr{o SR b'b <7 Rp vo} 
a'Sa b'Xb 1 a'Sa 
< Pr{ = min aa < bb < ro we vo} 


where J, and /, are the largest and smallest characteristic roots of S and A, 
and A, are the largest and smallest characteristic roots of 2. Then 


1 1 
(32) nip SME) S Th 
is a confidence interval for all characteristic roots of % with confidence at 


least 1—e. In Section 11.6 we give tighter bounds on AZ) with exact 
confidence. 


444 TESTING HYPOTHESES OF EQUALITY OF COVARIANCE MATRICES 


10.9. TESTING THE HYPOTHESIS THAT A MEAN VECTOR AND 
A COVARIANCE MATRIX ARE EQUAL TO A GIVEN VECTOR 
AND MATRIX 


In Chapter 3 we pointed out that if W is known, (y — v,)' Wo '(¥ — v9) is 
suitable for testing 


(1) Hy: v=V, given W=W). 


Now let us combine H, of Section 10.8 and Hz, and test 


(2) Hiv=vy, W=YV,, 

on the basis of a sample y,,..., yy from N(v, WV). 
Let 

(3) X=C(Y¥~-v,), 

where 

(4) CWC’ =I. 


Then x),...,%, constitutes a sample from N(p, >), and the hypothesis is 
(5) H:p=0, Leal. 


The likelihood ratio criterion for H,:=0, given 2 =J, is 


en 


NE‘vx 


(6) A, =e" 


The likelihood ratio criterion for H is (by Lemma 10.3.1) 
(7) h=AM=(H ig “VAL e ~ dtd go INE'E 


4pN 
=(¥ et |A|2N en SC At Nez) 
N 


a J IN e7 32a Xq 

=(9) lar 

The likelihood ratio test (rejecting H if A is less than a suitable constant) is 
unbiased (Srivastava and Khatri (1979), Theorem 10.4.5]. The two factors a, 
and A, are independent because A, is a function of A and A, is a function of 
x, and A and x are independent. Since 


(8) EN = Be NEM @ ger at = (+h), 


10.9 TESTING MEAN VECTOR AND COVARIANCE MATRIX 445 
the Ath moment of A is 


9 \tpNh 1 

under the null hypothesis. Then 

(10) —2log A= —2log A, — 2log A, 

has asymptotically the y*-distribution with f=p(p+1)/2+p degrees of 
freedom. In fact, an asymptotic expansion of the distribution {Davis (1971)] of 


—2plog A ts 


(11) Pr{-—2plog A <z} 


¥2 A. 
= Pr{ xf <2} + aa (Pr{ xfs sz} — Pr x/ <z})+O(N 4), 
where 
_,— 2h+ pail 
(12) ne ON(pt+3) ” 
, 4 & 3 4 24 ran 
(13) ee dc + 18p* + 49p* + 36p 13) 


288( p — 3) 


Nagarsenker and Pillai (1974) used the moments to derive exact distributions 
and tabulated the 5% and 1% significance points for p= 2(1)6 and N= 
4(1)20(2)40(5)100. 


Now let us return to the observations y,,..., yy. Then 
(14) erate = 1 (ya By) 'C'C(¥a— Yo) 
a a 


= U(ye- ¥0)' Wo! (Ya ~ Yo) 


=trd+Nx'x 
=t(BWr') +N(¥— vo)’ Wo '(¥— vo) 


and 
(15) |A| = |BWe'|. 
Theorem 10.9.1, Given the p-component observation vectors y;,..., yx. from 


N(v, W), the likelihood ratio criterion for testing the hypothesis H:v =v). 


446 TESTING HYPOTHESES OF EQUALITY OF COVARIANCE MATRICES 


W = Wp, ws 


\ 
spN 
(16) A= ce [Bw tpn en tr BM tN G9) Wa CF —v od 


When the null hypothesis is trac, —2log A is asymptotically distributed as x? 
with 4p( p + 1) + p degrees of freedom. 


10.10. ADMISSIBILITY OF TESTS 


We shall consider some Bayes solutions to the problem of testing the 
hypothesis 


(1) yak 
as in Section 10.2. Under the alternative hypott esis, let 
(2) [WE] =[(4+0,,) Cy, (1+ 6,C,)"'], eee E 


where the pXr, matrix C, has density proportional to |J+C,C%|~ 7, 
n, =N,~—1, the r,-component random vector y‘*? has the Conditional normal 
distribution with mean 0 and covariance matrix C/N L,, ~— CAC, + 
C,C,)'C,|"! given C,, and (Ey, HO), «0 (Cg 9") are independently dis- 
tributed. As we shall see, we need to choose suitable integers r,,..., lq: Note 
that the integral of |¥+C,C;|~ *« is finite if n,2p+r,. Then the numera- 
tor of the Bayes ratio is 


q x oO ' 
(3) const TT f fo ler c,cg 


1 
‘exp “5 


(+ C,C,)| x ~ (I+ cc) "er 


Ne 


[xe - (1+,C,) ‘c,y] 


a=] 


\r+c,c7 | r~ c1r+ ¢,c,) "|" 


' f F -1 
-exp( — 4N, yo" | = CU(1+ CC) 'Co| yo) dy‘®) dC, 


10.10 ADMISSIBILITY OF TESTS 447 
qm Whee 
=const [| | f ef exp) — 5 e+ CC) x) 
gals Se a= | 


Ny 
yO eS xie) is myers dy(*) dC, 


a= | 
2 : _i . (8)f y(R) ‘i ae - 
sreoust \ebere ZL Xe xi i i 
-exp{ — 4N,(y'® — Cr ='®)'( y@ — Cy x8) — 341 A.C, dy dC, 


eRe 


= const Tl exp{ — 3[tr A, + N, x'8)'x°8) \A,! ae 


Under the null hypothesis let 


rol mo 
(4) [w®, 2.) =[(r+ cc) "ey, (r+cc')""], 
where the pXr matrix C has density proportional to |f+CC'|~ ”, n= 
Lan ,» the r-component vecto: y'®) has the conditional normal distribution 


with mean 0 and covariance matrix (1/N,)[I, - C’'U, + CC’) 1C]~! given C, 
and y,..., y‘8) are conditionatly independent. Note that the integral of 
[f+ CC'|~ ™ is finite if n> p +r. The denominator of the Bayes ratio is 


(5) 

fos) oo 4 ; 1 Ne 1 ‘ 
i ef T] i+ cc 2% v0 3 ys [x - (7+.€C0")7 cy 
ae gad a=l 


(1+ CC')| x) — (4+ coe 


(r+ CC'\~ | 1-C'(1+ CC’) 


dC 


-exp{ — 3 N,y¥®"[1-C'(14 007) 'C] dy 


44§ TESTING HYPOTHESES OF EQUALITY OF COVARIANCE MATRICES 
co co q 1 N, 
= const f ef I] exp =a Ys x14 CC) x? 
72 egal a=1 


Ne 
—2y8'C’ ee x8) nxn dy®) dC 


a=] 


; 


1 2 
=constexp{~z i Lo 
@=1 a= 


a fe 


q q 
exo ; Pe N(y® = C'x(8))1( yl) mn crx) = ttr cvac' [Taye 
a=) g= 


1 a ome 
= cnstexp|— 3 [a + s ner al ov 


gn) 


The Bayes test procedure is to reject the hypothesis if 


tr 
"7 Tae? 
e=t!Agl 
For invariance we want L4_,r, =r. 
The binding constraint on the choice of r,,...,7, iS r,Sng—p, B= 


1,...,qg. It is possible in some special cases to choose r,,...,7, so that 
(r1,. 1690q) is proportional to (N,,..., N,) and hence yield the likelihood ratio 
test or proportional to (m,,...,%,) and hence yield the modified likelihood 
ratio test, but since r,,..., i have to be integers, it may not be possible to 
choose them in either such way. Next we consider an extension of this 
approach that involves the choice of numbers ti,..-,f,, and ¢ as well as 
Vyyeeea Tg and r. 

Suppose 2(p—1)<n,, g=1,...,q, and take r,>p. Let ft, be a real 
number such that 2p—1<r,+t,+p <n, +1, and let ¢ be a real number 
such that 2p ~1<r+t+p<n+1. Under the alternative hypothesis let the 
marginal density of C, be proportional to |C,Cj] "el + C,Cil~ Me, g= 
1,...,g, and under the null hypothesis let the marginal density of C be 
proportional to |CC’| “IT+CC'|~ ®. (The conditions on tiy...yf,, and ¢ 
ensure that the purported densities have finite integrals; see Problem 10.18.) 
Then the Bayes procedure is to reject the null hypothesis if 


4(r41) 
) Al dtr, 4t,) — 
TTS 24 |A,l ame § 


10.11 ELLIPTICALLY CONTOURED DISTRIBUTIONS 449 


For invariance we want t= Lf_ jt. If f,,...,f, are taken so 7, +f, = kN, and 


p-1< KN, <N,—p, g=1,...,q, for some k, then (7) is the likelihood ratio 
test; if r,+f,=kn, and p~1 <kn,<n,+1—p, g=1,...,q, for some k, 
then (7) is the modified test [Le., (p — 1)/min, N, <k <1—p/min, N,], 


Theorem 10.10.1.- ff 2p<N,+1, g=1,...,q, then the likelihood ratio 
test and the modified likelihood ratio test of the null hypothesis (1) are admissible. 


Now consider the hypothesis 
| re t fs tetas chy diet 
(8) yo = a (y) r= = ,. 


The alternative hypothesis has becn treated before. For the null hypothesis 
let 


(9) [w,2,] =[U+ceyo,(r+ cc’), 


where the p Xr matrix C has the density proportional to |F+CC’|> >"! 
and the r-component vector y has the conditional normal distribution with 
mean 0 and covariance matrix (1/N)[J—- C'(1+CC’)"'C]"' given C. Then 
the Bayes procedure is to reject the null hypothesis (8) if 

Og. A, + D9, N, (FH —F)( FO —F)'|' 


(10) > 
THj-114,l° : 


2c, 
If2p<N,+1, g=1,...,q, the prior distribution can be modified as before 
to obtain the likelihood ratio test and modified likelihood ratio test. 


Theorem 1010.2. ff 2p<N,+1, g=1,...,q, the likelihood ratio test 
and modified likelihood ratio test of the null hypothesis (8) are admissible. 


For more details see Kiefer and Schwartz (1965). 


10.11. ELLIPTICALLY CONTOURED DISTRIBUTIONS 


10.11.1. Observations Elliptically Contoured 


Let x) a@=1,..., N,, be N, observations on XS) having the density 
(1) [A gl tg [(a- wey Ag (x— v®)], 


where &[(X — vB ATCX— yibypP = &R)<oo, g=1,...,q. Note that the 
same function g(-) is used for the density in all g populations. Define N, A,. 


450 TESTING HYPOTHESES OF EQUALITY OF COVARIANCE MATRICES 


g=l,.... q, and A by (1) of Section 10.2. Let S,=(1/n,)A,, where n, = 
N,— 1, and S =(1/n)A, where n= Lpailg- 
"Suice the likelihood ratio criterion A, is invariant under the transforma- 


tion X'8) = CX!) + ps), under the null hypothesis we can take = = 2, 
=[and p= ..- = pls) = 9. Then 
4 ~ fa9 
(2) ~2log A, = XN, logl2 oq! — N log! ,| 
gal 


8 : 
-| YN, log| I+ (S.q—1)|-N log] r+ (8.-n) 
gel 


—N|u(B,-1) = in(8,- 1 +0,009]| 


il 


q 7 A 
LYN tr(¥,q—-1) —4Ntr(Z,-7) +0,(N-) 
g=l 


wie 


4 ~ , al 
1s N, |vec( & 50 ~1)| vec( Zon —1) 
gel 
- 1N[vec(%., -1)|'vee(%, —T) + O,(N~*). 
By Theorem 3.6.2 


(3) VN, vec(S, =) oF N[0,(« + 1) (2,2 + Ky) + K vec I, (vec by | 


and n,S,=N, OS ae = 1,...,q, are independent. Let N, =KIN, g= 1-054 

vie Ke =1, and let N > cc. In terms of this asymptotic theory the limiting 

aemnbitian of vec(S, —£),-.., vec(S, — I) is the same as the distribution of 

y),_... 9! of Section 8.8, with & of Section 8.8 replaced by («+ 1,2 + 
K,,) +x vec I, (vec I)’. 


When & 2], the variance of the limiting distribution of ,/N, (s/f? ~ 1) is 
3x +2; the covariance of the limiting distribution of /N, (sts) 1) and 


JN. (sf) — 1), i#j, is «; the variance of s@, i#j,is «+1; i set YN (s(2) 


—1)...-,¥N (sit) — 1) is independent of the set (s'), i#j; and the s(*, i<j, 
are mutually uncorrelated (as in Section 7.9.1). 


10.11 ELLIPTICALLY CONTOURED DISTRIBUTIONS 451 


Let y, =ve Zen a} I) and y ca vec(S,, i I). Then y = LEON, /N DY, and 


(4) —2log A=3 LE N,(5,—¥)'(5, -¥) 


q 
=40| ny Nip5e~ 13" 
g=l 


Let @ be a q Xq orthogonal matrix with last column (/N,/N,.--, /N,/NY. 
Define 


(5) (Wy, -065hy) = (YN Fis--s VMI, \O 


Then w, = ¥Ny and 


qg-l 
(6) E Mid Nyy = 2 ww, 
gal gel 
In these terms 
q-1 
(7) —2logaA,= 5 Y ww, + O0,(N~%), 
g=l 
and w,,...,W,_, are asymptotically independent, w, having the covariance 
matrix of Ne that is, («+ 1XL, 2+K, ye K vec I, (vec I,). Then WW, = 
Lf j=. (wi?) Le (wi)? + 2B) < (wie? The covariance matrix of 
Wii is ox + 1)I, + kee, where e=(1,...,1)', The characteristic 


roots of this matrix are 2(« +1) of multiplicity p—1 and a single root of 
2K +1) +pk. thts Lfe(wi?)? has the distribution of 2(« +1)x°_, + 
[2(« + 1) +pic] x7. The distribution of 22, (w{f)’ is 2(« + Uxxp-p/2" 


Theorem 10.11.1. When sampling from (1) and the null hypothesis is true, 
d 
(8) -2log aA, > (K+ 1) XG -Np- np +292 + [(« +1) +px/2] Xgou 
When x=0, —2loga, 4 XG-Dptp+h)/2 is in agreement with (12) of 
Section 10,5. The validity of the distributions derived in Section 10.4 depend 


on the observations being normally distributed; Theorem 10.11.1 shows that 
even the asymptotic theory depends on nonnormality. 


452 TESTING HYPOTHESES OF EQUALITY OF COVARIANCE MATRICES 


The likelihood criteria for testing the null hypothesis (2) of Section 10.3 is 
the product A,A, or V,V,. Lemma 10.4.1 states that under normality V, and 
V, (or equivalently A, and A,) are independent. In the elliptically contoured 
case we want to show that log V, and logV, are asymptotically independent. 


Lemma 10.11.1. Let A; =n,S, and A, =n,S, be defined by (2) of Section 


10.2 with X,=%,=J. Then AA, +A,)~' and A, +A, are asymptoticaily 
independent. 


Proof. Let (Ly/n, XA, — 1,1) =W,, g= 1,2. Then 


(9) 

2 n nyn NyynyN 
n, | A,(A, +4,)°'- — 1] = haw, - m+ 0 1), 
Vv | a(n +42) ae ce ray Gentle + nC 
(10) 

= [on 
Then 
(11) & vec nh py ees is ie en 
2 Fh 22 
(nm, +n) (nm, +n) 


ie ae ee ee | er, = 
veClV ny +n it mtn, 2]] °° 


By application of Lemma 10.11.1 in succession to A, and A, +A), to 
A, + A, and A, + A, +A;, etc., we establish that A,A7', A,A™',..., 4,47! 
are independent of A = A, + :*: + A,. It follows that V, and V, are asymptot- 
ically independent. 


Theorem 10.11.2. When %,= ++ ==, and p) = +> = pl, 
(12) —2log A, A, = —2log A, — 2]og A, 


d 
> (K+ Nye a\e=neene + [(«+ 1) +pK/2] Xqnt - Xoq-D: 


10.11. ELLIPTICALLY CONTOURED DISTRIBUTIONS 453 


The hypothesis of sphericity is that 2 = a7 (or A= AJ). The criteron is 
A, A,, where 


13 Ve |A| we Wnts [Tf - 
(13) (ate 2= (nay 
p 


i=} 4yj 
The first factor is the criteron for independence of the components of X, and 
the second is that the variances of the components are equal. For the first we 
set g=p and p,=1 in Theorem 9.10, and for the second we set g=p and 
p= 1. Thus 


d Et co] 
(14) ~2Zlog( A, Ag) > (1 + «) ype, + (3K + 2) yF 1. 


10.11.2. Elliptically Contoured Matrix Distributions 


Consider the density 
: -N,/2 : : 
(15) i JA | e2gl tr x AZ!( X= vey, )(X@ — vey.) 
q q q . 
=| [A178 tr >> A;'A,+  N,( 2 =v) aan —oey) 
gmt g=l gel 
In this density (A,,x,), g=1,...,q, is a sufficient set of statistics, and the 


likelihood ratio criterion is (8) of Section 10.2, the same as for normality 
[Anderson and Fang (1990b)). 


Theorem 10.11.3. Let f(X) be a vector-valued function of X = 
(X™,...,X) (p XN) such that 


(16) f(x + vey XD + vide | =f(X,..., XM) 

for every (v,...,v) and 

(17) FCCK, 2. CX) = F(X), XM) 

for every nonsingular C, Then the distribution of f{(X) where X has the arbitrary 


density (13) with A, = + =A, is the same as the distribution of f(X) where X 
has the normal density (15). 


454 TESTING HYPOTHESES OF EQUALITY OF COVARIANCE MATRICES 

The proof of Theorem 10.11.3 is similar to the proof of Theorem 4.5.4. 
The theorem implies that the distribution of the criterion V, of (10) of 
Section 10.2 when the density of X is 15) with A, = + = A, is the same as 
for normality. Hence the distributions and their asymptotic expansions are 
those discussed in Sections 10.4 and 10.5. 


Corollary 10.11.1. 
that 
(18) f(X+ vey) =f(X) 
for every v and (17) holds. Then the distribution of f(X), where X has the 
arbitrary density (15) with A, = + =A, and v= ++ =v), is the same as 
the distribution of f(X), where X has the normal density (45). 


Let f(X) be a vector-valued function of X (p X N) such 


If follows that the distribution of the criterion A of (7) or V of (11) of 
Section 10.3 is the same for the density (15) as for X being normally 
distributed. 


Let X (p XN) have the density 
(19) |AL" 7 gltr A7'(X— vey) (X~ vely)']. 


Then the likelihood ratio criterion for testing the null hypothesis A = AJ for 
some A>O is (7) of Section 10.7, and its distribution under the null 
hypothesis is the same as for X being normally distributed. 

For more detail see Anderson and Fang (1990b) and Fang and Zhang 
(1990), 


PROBLEMS 


10.1. (Sec. 10.2) Sums of squares and cross-products of deviations from the means 
of four measurements are given below (from Table 3.4), The populations are 
Iris versicolor (1), Iris setosa (2), and Iris virginica (3), each sample consists of 50 
observations: 


13.0552 4.1740 8.9620 2.7332 
Wi 4.1740 4.8250 4.0500 2.0190 
: 8.9620 4.0500 10.8200 3.5820 
2.7332 2.0190 3.5820 1.9162 
6.0882 4.8616 0.8014 0.5062 
A, = | 48616 7.0408 0.5732 0.4556° 
2" 10.8014 0.5732 1.4778 0.2974 |’ 
0.5062 0.4556 0.2974 0.5442 
19.8128 4.5944 14.8612 2.4056 
A, = | 45944 5.0962 3.4976 2.3338 
3 14.8612 3.4976 14.9248 2.3924 
2.4056 2.3338 2.3924 3.6962 


PROBLEMS 455 


(a) Test the hypothesis X, =, at the 5% significance level. 
(b) Test the hypothesis 2, = 2.,= 2, at the 5% significance level. 


10,2. (Sec. 10.2) 


(a) Let ¥, g=1,...,q, be a set of random vectors each with p components. 
Suppose 


= Ry? 
6YH=9, #syyl'=§,x,, 


Let C be an orthogonal matrix of order q such that each element of the 


last row is 
Con l/yq. 
Define 
q 
ZS) = Ss als g=1,...,q. 
hel 
Show that 
EZDZ'8)' =O, ge Lawl, 
if and only if 


Y,=E.=- =¥,. 


(b) Let X{), a=1,...,N, be a random sample from N(p, 2), g=1,....4 
Use the result from (a) to construct a test of the hypothesis 


A:2,= ind =a: 


based on a test of independence of Z‘” and the set Z',...,Z9-, Find 
the exact distribution of the criterion for the case p = 2. 


10.3. (Sec. 10.2) Unbiasedness of the modified likelihood ratio test of a7 = a}. Show 
that (14) is unbiased. [Hint: Let G=n,F/n2, r=a}/o?, and c, <c, be the 


solutions to G3" (L+G)- 42) = ke the critical value for the modified 
likelihood ratio criterion. Then 


c 1 4, 
Pr{Acceptancel a? /o? = r} = const f ANGIE 4 rG) "dg 
et] 
re ae 
= const f "Him 1 1 ey ay, 
Fey 


Show that the derivative of the above with respect to r is positive for0<r<1, 
0 for r= 1, and negative for r> 1.] 


456 TESTING HYPOTHESES OF EQUALITY OF COVARIANCE MATRICES 


10.4. (Sec. 10.2) Prove that the limiting distribution of (19) is xf, where f= 
3P( p +1Xq—-1). [Hints Let =I. Show that the limiting distibucon of 
(19) is the limiting distribution of 


ie ot 2 : 2 

5h La(siP-sy+ Lb ng( iP —s,) ? 

> ret gal i<j gol 
where S‘) =(sf#)), $ =(s,,), and the y/n, (s\f? — 6,,), i <j, are independent in 
the limiting distribution, the limiting distribution of /n,(s\f — 1) is M(0,2), 
and the limiting distribution of /n go, i<j, is NO, D.] 


10.5. (Sec. 10.4) Prove (15) by integration of Wishart densities. [Hint: Vp = 
él ald, | "s|4|~ ¥" can be written as the integral of a constant times 


|Al~ “id wl A, [Zin ,thn, ). Integration over £7_, A,-A gives a constant 
times w( Al, i 


10.6. og 10.4) Prove (16) by integration of Wishart and normal densities. [ Hint: 


4_,N,( 2) — x) x) —¥)' is distributed as ps EB TB ye Use the hint of Prob- 
nk 10.5.} 


10.7. (Sec. 10.6) Let x{)..., x7) be observations from N(p, Z,), v= 1,2, and 
let A, = DC x0) — KO) x) — ROM)", 


(a) Prove that the likelihood ratio test for H:%,=Z, is equivalent to 
rejecting H if 


A,|-|A 
_ IAL | loc. 
1A, +45)? 


(b) Let d?, d3,...,d2 be the roots of |Z, — AZ| =0, and let 


d O - O 
rai « = 
0 0 « d, 


Show that T is distributed as |B, - |By| /|B, + Bal’, where B, is dis- 
tributed according to W(D*,N—1) and B, is distributed according to 
W(1, N - 1). Show that T is distributed as |DC,D|-|C,| /|DC,D + Cyl’, 
where C, is distributed according to W(I, N — 1). 
10.8. (Sec. 10.6) For p=2 show 
Pr{V, <v} =/,(n, — 1,2 -1) 
+ Bo'(n,-1,n.- pitta B/m fPg-2m/ncy =e) ae 
a 


+1-I,(n,—1,2,-1), 


PROBLEMS 457 


where a <b are the two roots of xf(1—x,)" =o <niint /n". [Hini: This 
follows from integrating the density defined by (8).] 


10,9, (Sec. 10.6) For p=2 and 1, =n,=™m, say, show 


Pr{V, <0} 


=21,(m—- 1l,m— 1)+2B (nm - lin Teli lag go Ee, 
I—-vVi-devr® 


where a4 =3[1 — V1 —40!/™]. 


10.10. (Sec. 10.7) Find the distribution of W for p =2 under the null hypothesis (a) 


directly from the distribution of 4A and (b) from the distribution of che 
characteristic roots (Chapter 13). 


10.11. (Sec. 10.7) Let x,,...,x, be a sample from N(p, %). What is the likelihood 
ratio criterion for testing the hypothesis p =kKp,, =k" X,. where p,, and &,, 
are specified and k is unspecified? 


10.12, (Sec. 10.7) Let x(!,..., xf) be a sample from N(p!"), %,), and a xf! 
be a sample from Nip 2) %,). What is the likelihood ratio criterion for testing 
the hypothesis that X, = ? ic where k is unspecified? What is the likelihood 
ratio Criterion for testing the hypothesis that p)=kp and X%,=k°X,. 
where k is unspecified? 


10.13. (Sec. 10.7) Let x, of p components, a=1....,N, be observations from 
N(p, =). We define the following hypotheses: 


H,:E=k?X,, 


Hy: p=0, given that XS =k-Xp. 


In each case k? is unspecified, but %, is specified. Find the likelihood ratio 
criterion A, for testing H,. Give the asymptotic distribution of —2log A, 
under H,. Obtain the exact distribution of a suitable monotonic function of a, 
under H). 


10.14. ec. 10.7) Find the likelihood ratio criterion A for testing H of Problem 


10.13 (given x),...,x,). What is the asymptotic distribution af —2log A under 
H? 


10.15. (Sec. 10.7) Show that A=A,A., where A is defined in Problem 10.14, A. is 
defined in Problem 10.13, and A, is the likelihood ratio criterion for H, in 
Problem 10.13. Are A; and A, independently distributed under H? Prove vour 
answer. 


458 


10.16. 


10.17, 


10.18. 


10.19, 


10.20. 


TESTING HYPOTHESES OF EQUALITY OF COVARIANCE MATRICES 


(Sec. 10.7) Verify that BW! has the y7-distribution with p(N — 1) de- 
grees of freedom, 


(Sec, 10.7.1) Admissibility of sphericity test. Prov. that the likelihood ratin test 
of sphericity is admissible. [Hin Under the null hypothesis let P={1/+ 
7°)If. and let 7 have the density (1+ 7)- (y2)?7 3] 


(Sec. 10.10.1) Show that fcr r>p 


ne es ye 


if2p-—1<1+r+p<n+1. [Hine |A|/|1+A| <1 if A is positive semidefi- 
nite. Also, |E/_,x,x;] has the distribution of ytxyiy Xr per if tpt 
are independently distributed according to N(0, J).] 


! 
lz ~ ae 


Dea 


i=] 


r 
[ Jax, <0 


i=] 


r+ Dx 


r 
(Sec. 10.10.1) Show 
[ sei [CO‘|# e- OAC de = const| Al 274", 


where C is p x r.[ Hint: CC’ has the distribution W(4~ ' r) if C has a density 
proportional to e7 CAC) 


(Sec. 10.10.1) Using Problem 10.18, complete the proof of Theorem 10.10.1. 


CHAPTER 11 


Principal Components 


11.1. INTRODUCTION 


Principal components are linear combinations of random or statistical vari- 
ables which have special properties in terms of variances. For example, the 
first principal component is the normalized linear combination (the sum of 
squares of the coefficients being one) with maximum variance. In effect, 
transforming the original vcctor variable to the vector of principal compo- 
nents amounts to a rotation of coordinate axes to a new coordinate system 
that has inherent statistical proper‘ies. This choosing of a coordinate system 
is to be contrasted with the many problems treated previously where the 
coordinate system is irrelevant. 

The principal components turn out to be the characteristic vectors of the 
covariance matrix. Thus the study of principal components can be considered 
as putting into statistical terms the usual developments of characteristic roots 
and vectors (for positive semidefinite matrices). 

From the point of view of statistical theory, the set of principal compo- 
nents yields a convenient set of coordinates, and the accompanying variances 
of the components characterize their statistical properties. In statistical 
practice, the method of principal components is used to find the linear 
combinations with large variance. In many exploratory studies the number of 
variables under consideration is too large to handle. Since it is the deviations 
in these studies that are of interest, a way of reducing the number of 
variables to be treated is to discard the linear combinations which have small 
variances and study only those with large variances. For example, a physical 
anthropologist may make dozens of measurements of lengths and breadths of 


An Introduction to Multivariate Statistical Analysis, Third Edition. By T. W. Anderson 
ISBN 0-471-36091-0 Copyright © 2003 John Wiley & Sons, Inc. 


459 


460 PRINCIPAL COMPONENTS 


each of a number of individuals, such measurements as ear length, ear 
breadth, facial length, facial breadth, and so forth. He may be interested in 
describing and analyzing how individuals differ in these kinds of physiological 
characteristics. Eventually he will want to explain these differences, but first 
he wants to know what measurements or combinations of measurements 
show considerable variation; that is, which should have further study. The 
principal components give a new set of linearly combined measurements. It 
may be that most of the variation from individual to individual resides 
in three linear combinations; then the anthropologist can direct his study to 
these three quantities; the other linear combinations vary so little from one 
person to the next that study of them will tell little of individual variation. 

Hotelling (1933), who developed many of these ideas, gave a rather 
thorough discussion. 

In Section 11.2 we define principal components in the population to have 
the properties described above; they define an orthogonal transformation to 
a diagonal covariance matrix. The maximum likelihood estimators have 
similar properties in the sample (Section 11.3). A brief discussion of compu- 
tation is given in Section 11.4, and a numerical example is carried out in 
Section 11.5. Asymptotic distributions of the coefficients of the sample 
principal components and the sample variances are derived and applied to 
obtain large-sample tests and confidence intervals for individual parameters 
(Section 11.6); exact confidence bounds are found for the characteristic roots 
of a covariance matrix. In Section 11.7 we consider other tests of hypotheses 
about these roots. 


11.2. DEFINITION OF PRINCIPAL COMPONENTS 
IN THE POPULATION 


Suppose the random vector X of p components has the covariance matrix >. 
Since we shall be interested only in variances and covariances in this chapter, 
we Shall assume that the mean vector is 0. Moreover, in developing the ideas 
and algebra here, the actual distribution of X is irrelevant except for the 
covariance matrix; however, if X is normally distributed, more meaning can 
be given to the principal components. 

In the following treatment we shall not use the usual theory of characteris- 
tic roots and vectors; as a matter of fact, that theory will be derived implicitly. 
The treatment will include the cases where & is singular (ie., positive 
semidefinite) and where & has multiple roots. 

Let B be a p-component column vector such that B’B = 1. The variance of 
B'X is 


(1) &(B'X) = &B'XX'B= Bp’ ZB. 


11.2. DEFINITION OF PRINCIPAL COMPONENTS IN THE POPULATION 461 


To determine the normalized linear combination B'X with mzximum vari- 
ance, we must find a vector B satisfying B’B = 1 which maximizes (1). Let 


(2) $= B'EB-A(B'B-1) =D A8,8,-A{ D87-1), 


where A is a Lagrange multiplier. The vector of partial derivatives (d/aB,) 
is 


0 
(3) Sp 7 22B~2AB 

(by Theorem A.4.3 of the Appendix). Since B’2B and B’B have derivatives 
everywhere in a region containing B’B = 1, a vector B maximizing B'2B 
must satisfy the expression (3) set equal to 0; that is 


(4) (X—AI)B=0. 


In order to get a solution of (4) with B’B = 1 we must have & — AJ singular; 
in other words, A must satisfy 


(5) IS -- Ar] =0. 


The function | — AJ| is a polynomial in A of degree p. Therefore (5) has p 
roots; let these be A, > A, = ++: > A,. [B’ complex conjugate in (6) proves A 
real.| If we multiply (4) on the left by B’, we obtain 


(6) B'S B = AB'B=A. 


This shows that if B satisfies (4) (and B’B = 1), then the variance of B’X 
[given by (1)] is A. Thus for the maximum variance we should use in (4) the 
largest root A,. Let B“ be a normalized solution of (2 — A,J)B =9. Then 
U, = B™’X is a normalized linear combination with maximum variance. [If 
x —A,/ is of rank p—1, then there is only one solution to (2 — A, DB =0 
and B'B = 1,] 

Now let us find a normalized combination B’X that has maximum vari- 
ance of all linear combinations uncorrelated with U,. Lack of correlation 
means 


(7) O= &B'XU, = &B'XX'B = B'TBO = A,B'B” 


since ZB") = A,B". Thus B X is orthogonal to U in both the statistical sense 
(of lack of correlation) and the geometric sense (of the inner product of the 
vectors B and B' being zero). (That is, A,B’B‘? = 0 only if B’B"’ =0 when 
A, #0, and A, #0 if 2 #9; the case of & = 0 is trivial and is nor treated.) 


462 PRINCIPAL COMPONENTS 


We now want to maximize 
(8) ¢.=B'=B —A(B'B — 1) —2v,B' =p, 
where A and vy, are Lagrange multiplers. The vector of partial derivatives is 


(9) SG} = 23 - 2AB- 2», 2B", 


and we set this equal to @. From (9) we obtain by multiplying on the left by 


pe’ 
(10) 0 = 2p’ Sp — 2AB’B — 2B!) TP = —2y,A, 


by (7). Therefore, v, = 0 and B must satisfy (4), and therefore A must satisfy 
(5S). Let A.) be the maximum of A,,...,A, such that there is a vector B 
satisfying (Z— A,B =, B'B=1, and (7); call this vector B® and the 
corresponding linear combination U, = B’X. (It will be shown eventually 
that A... =A. We define A. = A,.) 

This procedure is continued; at the (, + 1)st step, we want to find a vector 
B such that B’X has maximum variance of all normalized linear combina- 
tions which are uncorrelated with U,,...,U,, that is, such that 
(11) O0=&BP'XU, = &B'XX'B™ = B' TP” = AwB'B, i=1,. 


ag he 


We want to maximize 
(12) ¢,,, =B'=B —A(B'B~1)-2) vp’ =p”, 
i=1 


where A and »v,,...,¥, are Lagrange multipliers, The vector of partial 
derivatives is 


(13) “Ost = 23 B- 298-20, ¥ SB, 


1=1 
and we set this equal to 0, Multiplying (13) on the left by B/’, we obtain 
(14) 0 = 2p" TB — 2ABY’B - 2, pO LB, 
If A.) #0, this gives —2v,A,,) =O and v,=0. If A, =0, then LB! =A, Bp” 


= 0 and the jth term in the sum in (13) vanishes Thus B must satisfy (4), and 
therefore A must satisfy (5). 


11.2 DEFINITION OF PRINCIPAL COMPONENTS IN THE POPULATION 463 

Let A,,1) be the maximum of A,,...,A, such that there is a vector B 
satisfying (X — A,,,)J)B = 9, BB’ =1, and (11); call this vector B°*, and 
the corresponding linear combination U,,,; =B°*?'X. If Aca: =O and 
Ay) =0, fj #r+1, then B®’ SB°*” =0 does not imply B’"B°*” = 0. How- 
ever, B’* can be replaced by a linear combination of B+” and the Bs 
with A,,)’s being 0, so that the new B’*” is orthogonal to all B”, j= 1,...,7. 
This procedure is carried on until at the (m + i)st stage one cannot find a 
vector B satisfying B’B=1, (4), and (11). Either m=p or m<p since 
p™,...,B°™ must be ‘inearly independent. 

We shall now show that the inequality m <p leads to a contradiction. If 
m<p there exist p—m vectors, say €,.41)-++,€, Such that B®’e, =O, 
ee; = 6, (This follows from Lemma A.4.2 in the Appendix.) Let 
(Em 4ir+++>€,) = E. Now we shall show that there exists a (p — m)-component 
vector c and a number 6@ such that Ec = Yc,e, is a solution to (4) with A = 6. 
Consider a root of |E'XE — 6J| = 0 and a corresponding vector e¢ satisfying 
E'S, Ec = 6c. The vector & Ec is orthogonal to B‘?,...,B°™ (since BS Ee 
= Ay BO’Leje, = AQ Lc,B"e, = 0) and therefore is a vector in the space 
spanned by e,,,,,---,@, and can be written as Eg [where g is a (p — m)- 
component vector]. Multiplying £Ee=Eg on the left by E’, we obtain 
E'S Ec = E' Eg =g. Thus g = 6c, and we have £( Ec) = 6( Ec). Then (Ec)'X 
is uncorrelated with B“’X, j=1,...,m, and thus leads to a new pO7*, 
Since this contradicts the assumption that m <p, we must have m =p. 

Let B= (pO eae Bp’) and 


ha 0 0 
OA 0 

(15) Ke = 
0 Or XG 


The equations BP = A,,,B“ can be written in matrix form as 

(16) ZB = BA, 

and the equations BB = 1 and pp“ = 0, r#s, can be written as 
(17) BB =I. 

From (16) and (17) we obtain 


(18) B'SB=A. 


464 PRINCIPAL COMPONENTS 


From the fact that 


(19) |S — All =/B'1-1Z - All -IBl 
= |B’'=B — AB’BI = |A-all 
= TI(Aq) — A) 


we see that the roots of (19) are the diagonal clements of A; that is, 
Aay = AD Ag = Ad, eee Ap) = Ay: 
We have proved the following theorem: 


Theorem 11.2.1. Let the p-component random vector X have €X =0 and 
& XX' =>. Then there exists an orthogonal linear transformation 


(20) U=p'x 


such that the covariance matrix of U is €UU' =. and 


A, 0 0 
0 A, 0 
(21) A= ; 
0 60 A 


where 4, >A, > *** 2A, =O are the roots of (5S). The rth column of B, B, 
satisfies (2 — A, IB“ = 0. The rth component of U, U, = BX, has maximum 
variance of all normalized linear combinations uncorrelated with U),...,U,_,. 

The vector U is defined as the vector of principal components of X. It will 
be observed that we have proved Theorem A.2.1 of Appendix A for B 
positive semidefinite, and indeed, the proof holds for any symmetric B. It 
might be noted that once the transformation to U,,...,U, has been made, 
it is obvious that U, is the normalized linear combination with maximum 
variance, for if U* = Lc,U,, where Lc? =1 (U* also being a normalized 
linear combination of the X’s), then Var(U*) = LeZA, = A, + LP, ¢7(A, — Ay) 
(since cj =1—Lc?), which is clearly maximum for c?=0, i=2,...,p. 
Similarly, U, is the normalized linear combination uncorrelated with U, 
which has maximum variance (U* = Lc,U. being uncorrelated with U, tmply- 
ing c, = 0); in turn the maximal properties of U;,...,U, are verified. 

Some other consequences can be derived. 


Corollary 11.2.1. Suppose A,4, = °' =A,am =U (ie, v is a root of 
multiplicity m); then X — vI is of rank p—m. Furthermore BX =(BY*P + 
B°*™) is uniquely determined except for multiplication on the right by an 
orthogonal matrix. 


11.2) DEFINITION OF PRINCIPAL COMPONENTS IN THE POPULATION 465 


Proof. From the derivation of the theorem we have (2 — vJ)B“'=0, 
i=rt+,...,r+m; that is, BO*),..., 507% are m linearly independent 
solutions of (2 ~ vI)B +0. To show that there cannot be another linearly 
independent solution, take L?_,x,B\, where the x, are scalars. If it is a 
solution, we have vLx,B) = £(Lx,B™) = Lx, LB = Dx, A,B. Since vx, = 
A,x,, We must have x,= 0 unless i=r+1,...,r+m. Thus the rank is p—™m. 

If B* is one set of solutions to ({— v/)B=0, then any other set of 
solutions are linear combinations of the others, that is, are B'A for A 
nonsingular. However, the orthogonality conditions B*’B* =J applied to the 
linear combinations give I = (B*A)'(B*A) = A’B* 'B*A =A'A, and thus A 
must be orthogonal. | 


Theorem 11.2.2. An orthogonal transformation V = CX of a random vector 
X leaves invariant the generalized variance and the sum of the variances of the 
components. 


Proof. Let €X=OQ and &XX' =. Then €V=0and &VV' =CXC’. The 
generalized variance of V is 


(22) CXC} =(C)-|E]-1C'] = 12] -|CC'| = 151. 


which is the generalized variance of X. The sum of the variances of the 
components of V is 


(23) } ev? =tr( C&C’) =tr(2C'C) =tr( Ll) =tr P= VEX, = 


Corollary 11.2.2. The generalized variance of the vector of principal compo- 
nents is the generalized variance of the original vector, and the sum of the 
variances of the principal components is the sim of the variances of the original 
variates. 


Another approach to the above theory can be based on the surfaces of 
constant density of the normal distribution with mean vector 0 and covari- 
ance matrix £ (nonsingular). The density is 


J detevty 


me eee 
a) (2a)? |Z] * 


and surfaces of constant density are ellipsoids 
(25) x'Z'x=C. 


A principal axis of this ellipsoid is defined as the line from —y to y, where y 
is a point on the ellipsoid where the squared distance x’x has a stationary 


4166 PRINCIPAL COMPONENTS 


point. Using the method of Lagrange multipliers, we determine the stationary 
points by considering 


(26) w=x'x—Ax' Xo! x, 


where A is a Lagrange multiplier. We differentiate y with respect to the 
components of x, and the derivatives set equal to 0 are 


(27) OY a 2x-2AE-'x=0, 
or 
(28) x=AL'x. 


Multiplication by © gives 
(29) Yx=Ax. 


This equation is the same as (4) and the same algebra can be developed. 
Thus the vectors B'?,...,B'”) give the principal axis of the ellipsoid. The 
transformation u = B’x is a rotation of the coordinate axes so that the new 
axes are in the direction of the principal axes of the ellipsoid. In the new 
coordinates the ellipsoid is 


Thus the length of the ith principal axis is 2fr,C : 

A third approach to the same results is in terms of planes of closest fit 
[Pearson (1901)]. Consider a plane through the origin, a'x = 0, where a’a = 
1. The distance of a point x from this plane is @'’x. Let us find the 
coefficients of a plane such that the expected distance squared of a random 
point X from the plane is a minimum, where 6X =9 and &XX'=%. Thus 
we wish to minimize &(a'X) = &a'XX'a=a'La, subject to the restric- 
tion «’a« = 1. Comparison with the first approach immediately shows that the 
solution is a = B'”?. 

Analysis into principal components is most suitable when all the compo- 
nents of X are measured in the same units. If they are not measured in the 
same units, “he rationale of maximizing B'S relative to B’B is question- 
able; in fact, the analysis will depend on the various units of measurement. 
Suppose A is a diagonal matrix, and let Y= AX. For example, one compo- 
nent of X may be measured in inchcs.and the corrcsponding component of Y 
may be measured in feet; another component of X may be in pounds and the 


11.3 MAXIMUM LIKELIHOOD ESTIMATORS 467 


corresponding one of Y in ounces. The covariance matrix of Y is &YY' = 
€AXX'A=ALA=YW, say. Then analysis of Y into principal components 
involves maximizing ¢(y'Y)*=y'Wy relative to y’y and leads to the 
equation 0=(W — pI)y = (ALA -— vI)y, where v must satisfy |W — vJ| = 
0. Multiplication on the left by A7' gives 


(31) 0=(2-—vA~*)(Ay). 


Let Ay =a; that is, y'¥ = y’AX=a'X. Then (31) results from maximizing 
&(a'X)* =a'Le relative to a’ A~? ae. This last quadratic form is a weighted 
sum of squares, the weights being the diagonal elements of A~?. 

It might be noted that if A~? is taken to be the matrix 


mH O° 0 
(32) azn| 0 7 ois 
0 0 Typ 


then W is the matrix of correlations. 


11.3. MAXIMUM LIKELIHOOD ESTIMATORS OF THE PRINCIPAL 
COMPONENTS AND THEIR VARIANCES 


A primary problem of statistical inference in principal component analysis is 
to estimate the vectors B",..., B” and the scalars A,,...,A,. We apply the 
algebra of the preceding section to an estimate of the covariance matrix. 


Theorem 11.3.1. Let x,,...,Xy be N (>p) observations from N(p, &), 
where > is a matrix with p different characteristic roots. Then a set of maximum 
likelihood estimators of y,...,A, and B,...,B”) defined in Theorem 11.2.1 
consists of the roots kj > -*: >k,, of 


(1) |S — kl] =0 

and a set of corresponding vectors b\,..., b”) satisfying 
(2) (2 — &J)b© =0, 

(3) bOBHO = 1, 


where & is the maximum likelihood estimate of &. 


468 PRINCIPAL COMPONENTS 


Proof. When the roots of |X —As| =O are different, each vector B“ is 
uniquely defined except that B“ can be replaced by — B“. If we require that 
the first nonzero component of B® be positive, then B“ is uniquely defined, 
and p, A, B is a single-valued function of pw, 2. By Corollary 3.2.1, the set of 
maximum likelihood estimates of w, A,B is the same function of fi, 3. This 
function is defined by (1), (2), and (3) with the corresponding restriction that 
the first nonzero component of 5“ must be positive. [It can be shown that if 
|X| #0, the probability is 1 that the roots of (1) are different, because the 
conditions on & for the roots to have multiplicities higher than 1 determine a 
region in the space of & of dimensionality less than 5p(p + 1); see Okamoto 
(1973).] From (18) of Section 11.2 we see that 


(4) % = BAB’ = YABB, 
and by the same algebra 
(5) S= Vk bop, 


Replacing 5“ by —b clearly does not change Lk,bb’. Since the 
likelihood function depends only on % (see Section 3.2), the maximum of the 
likelihood function is attained by taking any set of solutions of (2) and (3). 

x 


It is possible to assume explicitly arbitrary multiplicities of roots of &. If 
these multiplicities are not all unity, the maximum likelihood estimates are 
not defined as in Theorem 11.3.1. [See Anderson (1963a).] As an example 
suppose that we assume that the equation |% —AJz| =0 has one root of 
multiphcity p. Let this root be A,. Then by Corollary 11.2.1, 2—A,I is of 
rank Q; that iss }—A,J=0 or Y=A,I. If X is distributed according to 
N(w, 2) = Nip, A I), the components of X are independently distributed 
with variance A,. Thus the maximum likelihood estimator of A, is 


(6) hy i 


and & = AL, and B can be any orthogonal matrix. It might be pointed out 
that in Section 10.7 we considered a test of the hypothesis that & = A,J (with 
A, unspecified), that is, the hypothesis is that % has one characteristic root of 
multiplicity p. 

In most applications of principal component analysis it can be assumed 
that the roots of & are different. It might also be pointed out that in some 
uses of this method the algebra is applied to the matrix of correlation 


11.4 MAXIMUM LIKELIHOOD ESTIMATES OF THE PRINCIPLE COMPONENTS 469 


coefficients rather than to the covariance matrix. In general this leads to 
different roots and vectors. 


11.4. COMPUTATION OF THE MAXIMUM LIKELIHOOD 
ESTIMATES OF THE PRINCIPAL COMPONENTS 


There are several ways of computing the characteristic roots and characteris- 
tic vectors (principal components) of a matrix % or %. We shail indicate 
some of them. 


One method for small p involves expanding the determinantal equation 
(1) 0=|%-all 


end solving the resulting pth-degree equation in A (e.g., by Newton’s method 
or the secant method) for the roots A, > A,> "+ >A,, Then 2—A,F is of 
rank p — 1, and a solution of (2 — A,I)B‘? = 0 can be obtained by taking 8” 
as the cofactor of the element in the first (or any other fixed) column and jth 
tow of & — Aj. 

The second method iterates using the equation for a characteristic root 
and the corresponding characteristic vector 


(2) Lx=Ax, 


where we have written the equation for the population. Let x,y, be any vector 
not orthogonal to the first characteristic vector, and define 


! 


(3) X= Ly, yy) = >— + 
Wd) 1s OQ fo Ul? 
XX) 


It can be shown (Problem 11.12) that 


PEO tesa 


(4) jim Yu) = +B", jim XuyXuy = AL 


The rate of convergence depends on the ratio A,/A,; the closer this ratio is 
to 1, the slower the convergence. 
To find the second root and vector define 


(5) Y= EA PVP. 
Then 
(6) EB? = TB — A, POpBO pe 


= ZB” = A,B” 


470 PRINCIPAL COMPONENTS 


if i# 1. and 
(7) Xp? = 0. 


Thus A, is the largest root of %, and fp” is the corresponding vector. The 
iteration process is now applied to £, to find A, and B®. Defining 2, = 2, 
— A,B° B®’, we can find A, and B™, and so forth. 

There are several ways in which the labor of the iteration procedure may 
be reduced. One is to raise & to a power before proceeding with the 
iteration. Thus one can use £?, defining 


Xuy 


(8) ta Lens Jn =F i=0,1,2,.... 
Vrar%w 


This procedure will give twice as rapid convergence as the use of (3). Using 
X* = ¥°X? will lead to convergence four times as rapid, and so on. It should 
be noted that since &* is symmetric, there are only p(p + 1)/2 elements to 
be found. 

Efficient computation, however, uses other methods. One method is the 
OR or QL algorithm. Let 2, = 2. Define recursively the orthogonal Q, and 
lower triangular L, by 2, =@,L, and %,,, =L,;@; (=Q)2,Q,), i= 1,2,.... 
(The Gram-Schmidt orthogonalization is a way of finding @, and L,; the QR 
method replaces a lower triangular matrix L by an upper triangular matrix 
R.) If the characteristic roots of & are distinct, lim; _,..2,,, = A*, where A* 
is the diagonal matrix with the roots usually ordered in ascending order. The 
characteristic vectors are the columns of lim, _, .. @'Q/_, °*' @, (which is com- 
puted recursively). 

A more efficient algorithm (for the symmetric &%) uses a sequence of 
Householder transformations to carry = to tridiagonal form. A Householder 
matrix is H=I—2ae’ where a'a=1. Such a matrix is orthogonal and 
symmetric. A Householder transformation of the symmetric matrix ~ is 
HSH. It is symmetric and has the same characteristic roots as %; its 
characteristic vectors are H times those of =. 

A tridiagonal matrix is one with all entries 0 except on the main diagonal, 
the first superdiagonal, and the first subdiagonal. A sequence of p -2 
Householder transformations carries the symmetric & to tridiagonal form. 
(The first one inserts 0’s into the last p —2 entries of the first column and 
row of HSH, etc. See Problem 11.13.) 


11.5 AN EXAMPLE 471 


The QL method is applied to the tridiagonal form. At the ith step let the 
tridiagonal matrix be T$”; let P be a block-diagonal matrix (Givens matrix) 


I 0 0 0 
m 0 cos®# — sin 6 9 

(9) i= 0 sind cosé O@)’ 
0 0 0 I 


where cos 6, is the jth and j + 1st diagonal element; and let 7 = P}?, 7, 
j=1,...,p— 1. Here 6, is chosen so that the element in position j, j + 1 in T; 
is 0. Then P® = POp{)-.. PO, is orthogonal and POTS? = R© is lower 
triangular. Then TY+) = ROPO! (= POTMpPM) js symmetric and tridiago- 
nal. It converges to A* (if the roots are all different). For more details see 
Chapters II/2 and II/3 of Wilkinson and Reinsch (1971), Chapter 5 of 
Wilkinson (1965), and Chapters 5, 7, and 8 of Golub and Van Loan (1989). 
A sequence of one-sided Householder transformation (H %) can carry > to 
R (upper triangular), thus effecting the QR decomposition. 


11.5. AN EXAMPLE 


In Table 3.4 we presented three samples of observations on varieties of iris 
[Fisher (1936)]; as an example of principal component analysis we use one of 
those samples, namely Iris versicolor. There are 50 observations (N = 50, 
n= N —1= 49). Each observation consists of four measurements on a plant: 
x, is sepal length, x, is sepal width, x, is petal length, and x, is petal width. 
The observed sums of squares and cross products of deviations from means 
are 


2 13.0552 4.1740 8.9620. 2.7332 
_ _cy/. _eyr | 4.1740 4.8250 4.0500 2.0190 

(1) A= 2% (*a—#)(%6—¥)'=]_ 8'9690 4.0500 10.8200 3.5820 |” 
2.7332 2.0190 3.5820 1.9162 


and an eStimate of & is 


0.266433 0.085184 0.182899 0.055780 

(2) Tae ee 0.085184 0.098469 0.082653 0.041204 
"7 0.182899 0.082653 0.220816 0.073102 |" 

0.055780 0.041204 0.073102 0.039106 


472 PRINCIPAL COMPONENTS 


We use the iterative procedure to find the first principal component, by 
computing in.turn 2“ = §z%-, As an initial approximation, we use 2! = 
(1,0, 1,0). It is not necessary to normalize the vector at each iteration; but to 
compare successive vectors, we compute 2!) /z9-) =r, each of which is an 
approximation to /,, the largest root of S. After seven iterations, r“ agree to 
within two units in the fifth decimal place (fifth significant figure). This 
vector iS normalized, and $ is applied to the normalized vector. The ratios, 
r{®), agree to within two units in the sixth place; the value of /, is (nearly 
accurate to the sixth place) /, = 0.487875. The normalized eighth iterated 
vector is our estimate of B, namely, 


0.6867244 
0.305 3463 
ee 
(3) B= | 0.6236628 |’ 
0.2149837 


This vector agrees with the normalized seventh iterate to about one unit in 
the sixth place. It should be pointed out that /, and 6“ have to be calculated 
more accurately than /, and b@, and so forth. The trace of S is 0.624 824, 
which is the sum of the roots. Thus /, is more than three times the sum of 
the other roots. 

We next Compute 


(4) S,=S—1,b%5% 


0.0363559  -—0.0171179 —A.0260502 —0.016 2472 
— 0.017 1179 0.0529813 —0.0102546 0.009 1777 
—0.0260502 -—0.0102546 0.031 0544 0.007 6890 |’ 
— 0.016 2472 0.009 1777 0.007 6890 0.016 5574 


and iterate z) = $29"), using z’ = (0, 1,0,0). (In the actual computation 
S, was multiplied by 10 and the first row and column were multiplied by — 1.) 
In this case the iteration does not proceed as rapidly; as will be seen, the 
ratio of /, to /, is approximately 1.32. On the last iteration, the ratios agree 
to within four units in the fifth significant figure. We obtain /, = 0.072 3828 
and 


— 0.669 033 
0.567 484 
(2) — 
(©) ? 0.343 309 | 
0.335 307 


The third principal component is found from S, = S,—-/,b%5®, and the 
fourth from S, = S$, —1,b9b®'. 


11.6 STATISTICAL INFERENCE 473 
The results may be summarized as follows: 


(6) (Lystyy tas ty) = (0.4879, 0.0724, 0.0548, 0.0098), 


0.6867 —0.6690 —0.2651 0.1023 

(1) B= 0.3053 0.5675 —0.7296 -—0.2289 
: 0.6237 0.3433 0.6272 —0.3160 |" 

0.2150 0.3353 0.0637 0.9150 


The sum of the four roots is L*_ ,/, = 0.6249, compared with the trace of the 
sample covariance matrix, tr S = 0.624 824. The first accounts for 78% of the 
total variance in the four measurements; the last accounts for a little more 
than 1%. In fact, the variance of 0.7x, + 0.3, + 0.6.x, + 0.24, (an approxi- 
mation to the first principal component) is 0.478, which is almost 77% of the 
total variance. If one is interested in studying the variations in Conditions that 
lead to variations of (x,, x, x3, x,), one can look for variations in conditions 
that lead to variations of 0.7x, + 0.3x, + 0.6.x, + 0.2x,. It is not very impor- 
tant if the other variations in (x,, x.,2%3,x,) are neglected in exploratory 
investigations. 


11.6. STATISTICAL INFERENCE 


11.6.1. Asymptotic Distributions 


In Section 13.3 we shall derive the exact distribution of the sample character- 
istic roots and vectors when the population covariance matrix is I or 
Proportional to J; that is, in the case of all population roots equal. The exact 
distribution of roots and vectors when the population roots ure not all equal 
involves a multiply infinite scrics of zonal polynomials; that development is 
beyond the scope of this book. [See Muirhead (1982).] We derive the 
asymptotic distribution of the roots and vectors when the population roots 
are all different (Theorem 13.5.1) and also when one root is multiple 
(Theorem 13.5.2). Since it can usually be assumed that the population roots 
are different unless there is information to the contrary, we summarize here 
Theorem 13.5.1. 

As earlier, let the characteristic roots of & be A, >: >A, and the 
corresponding characteristic vectors be B'"",...,B‘”. normalized so B’B'” 
=] and satisfying B,,20, i= 1,...,p. Let the roots and vectors of S be 
L>e >d, and b™,..., 6 normalized so b\'b™ = 1 and satisfying 6,, > 0, 
i=1,...,p. Let d.=yn(t,—A,) and g = vn (b - BO), 7=1,.... p. Then 
in the limiting normal distribution the sets d,,...,d, and g'”,...,g'”' are 
independent and d,,...,d, are mutually independent. The element d, has 


474 PRINCIPAL COMPONENTS 


the limiting distribution N(0,2A?). The covariances of g,..., g” in the 
limiting distribution are 


k k 
ERGRET So) p° p! a 


p 
l VE (gM) = 
\s ) re (A, =e 


| A,A ee 
(2) te (g®, gO) = ~~ pp”, pee 


See Theorem 13.5.1. 

In making inferences about a single ordered root, one treats /, as approxi- 
mately normal with mean A, and variance 2A?/n. Since J, is a consistent 
estimate of A,, the limiting distribution of 


(3) fees 


is N(0.1). A two-tailed test of the hypothesis A, =A? has the (asymptotic) 
accuptalice region 


L,-a? 
(4) -2(e)<f 5 = 0 <z({e), 


where the value of the N(0,1) distribution beyond z(é) is 4¢. The interval 
(4) can be inverted to give a confidence interval for A, with confidence 1 — e: 


dj 
<A,5 HST 
‘ 1-y2/nz(e) 


L 
a 1+ ¥2/nz(e) 


Note that the confidence coefficient should be taken large enough so 
¥2/nz(e) <1. Alternatively, one can use the fact that the limiting distribu- 
tion of vn (log J, — log A,) is N(0,2) by Theorem 4.2.3. 

Inference about components of a vector PB“? can be based on treating b“ 
as being approximately normal with mean PB“ and (singular) covariance 
matrix 1/n times (1), 


11.6.2. Confidence Region for a Characteristic Vector 


We use the asymptotic distribution of the sample characteristic vectors to 
obtain a large-sample confidence region for the ith characteristic vector of = 
[Anderson (1963a)]. The covariance matrix (1) can be written 


(6) BA‘B’ = B; A;’B;’, 


11.6 STATISTICAL INFERENCE 475 


where A, is the p X p diagonal matrix with 0 as the ith diagonal element and 
JAA; /(A; — Aj) as the jth diagonal element, j #7; AY isthe (p — 1) x (p - 1) 
diagonal matrix obtained from A, by deleting the ith row and column; and 
B; is the p x (p — 1) matrix formed by deleting the ith column from B. Then 
AO = A BE yn (6 — B) has a limiting normal distribution with mean 0 
and covariance matrix 


(7) €(h) = At BY (BPAY BY )BT As =1_,, 
end 
(8) hO AO = n( BO — BO)’ BR Aat—2B* (BM — BO) 


has a limiting y?-distribution with p - 1 degrees of freedom. The matrix of 
the quadratic form in ¥n (6 — B) is 


P A: 
(9) pray Bt =F p| 5-24 
} [ 


j=l 


po! — p{ 5 ee ee aL 
=A,£7'- 21+ (1/A,)= 
because BA~'B’ = =7!, BB’ =J, and BAB' = &. Then (8) is 
(10) n(bO) — BO) AZo! = art (17a) 2] (2 = B®) 
=nb'| A,Z7' — 20+ (1/A,) 2] 5 
=n[A,bO Sb + (17A,)bO' Sb — 2], 
because B‘”' is a characteristic vector of & with root A;, and of £7! with 


root 1/A,. On the left-hand side of (10) we can replace & and A, by the 
consistent estimators § and /, to obtain 


(11) n(b® = Bp?) [487 = 974 (1/i,)8|(b = p) 
=n[BO'S™'BO + (1/1) BO'SBO — 2], 


which has a limiting x*-distribution with p — 1 degrees of freedom. 

A confidence region for the ith characteristic vector of & with confidence 
1 — € consists of the intersection of BB = 1 and the set of B“ such that 
the right-hand side of (11) is less than xp-14), where Pr{ Xp-1 > xp-(e)} =e 
Note that the matrix of the quadratic form (9) is positive semidefinite. 


476 PRINCIPAL COMPONENTS 


This approach also provides a test of the null hypothesis that the ith 
characteristic vector is a specified BY (pS’BY?=1). The hypothesis is 
rejected if the right-hand side of (11) with B“ replaced by Bi!) exceeds 
xp-iCe). 

Mallows (1961) suggested a test of whether some characteristic vector of 
~ is Bo. Let By be pX(p-1) matrix such that BjB,=9. if the null 
hypothesis is true, Bj) X and B,X are independent (because B, is a nonsingu- 
lar transform of the set of other characteristic vectors), The test is based on 
the multiple correlation between Bi,X and B,X. In principle, the test 
procedure can be inverted to obtain a confidence region. The usefulness of 
these procedures is limited by the fact that the hypothesized vector is not 
attached to a characteristic root; the interpretation depends on the root (e.g,, 
largest versus smallest). 

Tyler (1981), (1983b) has generalized the confidence region (11) to includ2 
the vectors in a linear subspace. He has also studied casing the restrictions of 
a normally distributed parent population. 


11.6.3. Exact Confidence Limits on the Characteristic Roots 


We now consider a confidence interval for the entire set of characteristic 
roots of %, namely, A, > ++" >A, [Anderson (1965a)]. We use the facts that 
po sp =i, pope =1, i=1,p, and pOo'sp” =Q= po’p?, Then 
B‘)’X and BX are uncorrelated and have variances A, and A,, respec- 
tively. Hence np’SB/A, and np’ SB)/A, are independently dis- 
tributed as y? with n degrees of freedom. Let / and u be two numbers such 
that 


(12) 1— e=Pr{nl < x2}Pr{ x? < nu}. 


Then 


(13) l-e=Pril< 


Mga ply p(y) 
BO'spo B”'SB <u| 


A, x, 


~ P| BY SB i psp) 
\ 
\ 


11.6 STATISTICAL INFERENCE 477 


Theorem 11.6.1. A confidence interval for the characteristic roots of %& 
with confidence at least 1 — e is 


(14) L,/usa,sa, sl/l, 
where | and u satisfy (12), 

A tighter inequality can lead to a better lower bound, The matrix H = 
np’SB has characteristic roots nl,,..., nl, because B is orthogonal, We use 


the following lemma. 


Lemma 11.6.1. For any positive definite matrix H 
(15) chy(H) <=, <ch(H), P=] accep, 


where H~' =(h") and ch,(H) and ch,(H) are the minimum and maximum 
characteristic roots of H, respectively. 


Proof. From Theorem A.2.4 in the Appendix we have ch,(H) <h,,< 
ch,(H) and 


(16) ch,(H™') <A" <ch,(H7'), l=1...., p. 


Since ch,(H) = 1/ch,(H7') and ch,(H)=1/ch,(H™'), the lemma follows, 
| 


The argument for Theorem 5.2.2 shows that 1/(A,A??) is distributed as 
x? with n —p +1 degrees of freedom, and Theorem 4.3.3 shows that h?? is 
independent of f,,. Let [' and u’ be two numbers such that 


(17) 1—e=Pr{nl' < x2 }Pr{ x74, <nu’}. 
Then 

18 1—e=P pre pd og 
(18) 1-—e=Prin ay Ver Ps 


since ch (H) =nl, and ch,(H) = nl,. 


478 PRINCIPALCOMPONENTS 


Theorem 11.6.2. A confidence interval for the characteristic roots of %& 
with confidence at least 1 — e is 


(19) ~<A, <A <7, 


where I’ and u' satisfy (17). 


Anderson (1965a, 1965b) showed that the above confidence bounds are 
optimal within the class of bounds 


(20) fier) Saye hk Sep estys 


where f and g are homogeneous of degree 1 and are monotonically nonde- 
creasing in each argument for fixed values of the others. If (20) holds with 
probability at least 1 — e, then a pair of numbers u’ and /' can be found to 
satisfy (17) and 


l 
(21) flliosh SA Pe (luesvsl,): 


The homogeneity condition means that the confidence bounds are multiplied 
by c~ if the observed vectors are multiplied by c (which is a kind of scale 
invariance), The monotonicity conditions imply that an increase in the size of 
§ results in an increase in the limits for & (which is a kind of consistency). 

The confidence bounds given in (31) of Section 10.8 for the roots of & 
based on the distribution of the roots of S when % = J are greater. 


11.7. TESTING HYPOTHESES ABOUT THE CHARACTERISTIC 
ROOTS OF A COVARIANCE MATRIX 


11.7.1. Testing a Hypothesis about the Sum of the Smallest 
Characteristic Roots 


An investigator may raise the question whether the last p—m principal 
components may be ignored, that is, whether the first m principal compo- 
nents furnish a good approximation to X, He may want to do this if the sura 
of the variances of the last principal components is less than some specified 
amount, say y. Consider the null hypothesis 


(1) Hany, toe tAp2 Ys 


11.7 CHARACTERISTIC ROOTS OF A COVARIANCE MATRIX 479 


where y is specified, against tlie alternative that the sum is less than +. If the 
characteristic roots of = are different, it follows from Theorem 13.5.1 that 


Pp Pp 
(2) eee aa Dae” 
t=em+l1 f=m+1 
has a limiting normal distribution with mean 0 and variance 2L?. 2. The 


rem+] 


variance can be consistently estimuted by 27_,,,,/?. Then a rejection region 
with (large-sample) significance level ¢ is 


P OEP ed 
(3) x aS eee 2(2e), 
iembl n 


where z(2¢) is the upper significance point of the standard normal distribu- 
tion for significance level ¢. The (large-sample) probability of rejection is « if 
equality holds in (1) and is less than « if inequality holds, 

The investigator may alternatively want an upper confidence interval for 
cP +1 A, With at least approximate confidence level 1 — «. It is 


P P (aS ee i? 
(4) ie nh ee 
font] rat vn 


If the right-hand side is sufficiently small (in particular less than yy), the 
investigator has confidence that the sum of the variances of the smallest 
p-—m principal components is so small they can be neglected. Anderson 
(1963a) gave this analysis also in the case that A,,,, = “7 = Ap- 


11.7.2. Testing a Hypothesis about the Sum of the Smallest 
Characteristic Roots Relative to the Sum of All the Roots 


The investigator may want to ignore the last p —m principal components if 
their Sum is small relative to the sum of all the roots (which is the trace of the 
covariance matrix), Consider the null hypothesis 


Len ame +A, 

(5) aes Gar a > 6, 
where 6 is specified, against the alternative that f(X) < 6. We use the fact 
that 

3 Ame tor ta 

GIS) ag CN i=1,...,m, 

OA, 

(6) 

ofA) _ Apt +A, 


7 ; i=m+1,...,p. 
Ay (Apt +A,)° ‘ 


480 PRINCIPAL COMPONENTS 


Then the asymptotic variance of f(J) is 


2 = 2 
Ae) Ott ah) +2[ EP) Oda te em) 


when equality holds in (5), by Theorem 4.2.3, The null hypothesis H is 
rejected if ¥n[f(Z) — 8] is less than the appropriate significance point of the 
standard normal distribution times the square root of (7) with A’s replaced by 
Psand tr > by tr. Alternatively one can construct a large-sample confidence 


region for f(A). A confidence region of approximate confidence 1 ~— e is 
[z=2z(2e)| 


(8) 


Chord, Demat , (CE hm eh) Etalh + AC) Lewes ll] 
Sy oS Sp eS 
ile meen va (ZPayl)” 


If the right-hand side is sufficiently small, the investigator may be willing to 
let the first principal components represent the entire vector of measure- 
ments, 


11.7.3. Testing Equality of the Smallest Roots 


Suppose the observed X is given by V+ U + p, where V and U are unobserv- 
able random vectors with means 0 and yw is an unobservable vector of 
constants. If @UU' = oI, then U can be interpreted as composed of errors 
of measurement: uncorrelated components with equal variances. (It is 
assumed that all components of X are in the same units.) Then V can be 
interpreted as made up of the systematic parts and is supposed to lie in an 
m-dimensional space. Then @VV’ =® is positive semidefinite of rank m. 
The observable covariance matrix & = ® + oJ has a characteristic root of 
o? with multiplicity p — m (Problem 11.4), 

In this subsection we consider testing the null hypothesis that A,,., = °° 
= A,. That is equivalent to the null hypothesis that 2 = ® + oJ, where ® is 
positive semidefinite of rank m. In Section 10.7, we saw that when m = 0, the 
likelihood ratio criterion was the $pNth power of the ratio of the geometric 
mean to the arithmetic mean of the sample roots, The analogous criterion 
here is the Nth power of 


Im 


Bias L, 
(9) spee Cp). 


(x! adit L,)” 


11.7 CHARACTERISTIC ROOTS OF A COVARIANCE MATRIX 481 


It is also the likelihood ratio criterion, but we shall not derive it. [See 
Anderson (1963a).] Let Vn Ul, -— Ana, =4,, i= m+1,..., p. The logarithm 
of (9) multiplied by —n is asymptotically equivalent under the null hypothesis 
to : 


4 Lrematt 
(10) —n log t Ann m) og em 


pa aera Cre +n~?d,) 


P 
=-n YY log(Ams, +27 #d,) +n( p— m) log oem 


i=m+t+l 


B Lele atsciith 
=ni— ¥ log}1+ 7] +(p—m)log|1 + = _, 
t=m+1 m+l : (P-—M)A,.4.)9F 
Ee d, d? 
=n — y : 7 te 
f=m+1 Ame? 2h lM 
Pond, Po ad Je 
+(p-m) pare i r= Cefemat i) paar 
(P-M)Anwy@ 2(p—m) Anan 


nl 5 at ~ = Eh) 5 | 


2 = 
T= + 1 12 ae E 2( p m)d an 


2 


+ 0,(1). 


m+il 


| £8 ral Sa 


2 i m 
~ 2A =mn+1 P t=mt+1 


It is shown in Section 13.5.2 that the limiting distribution of d,,,,,...,d4, is 
the same as the distribution of the roots of a symmetric matrix U,,=(u,,), 
i,j=m+1,..., p, whose functionally independent elements are independent 
and normal with mean 0; an off-diagonal element u,, '<j, has variance 
A? .,, and a diagonal element u,, has variance 2A?,,,. See Theorem 13.5.2. 


Then (10) has the limiting distribution of 


1 1 2 
(11) 2d (tr UZ, - pom (tt Ux) 


482 PRINCIPAL COMPONENTS 


Thus ©, , ,u;,/A;,., is asymptotically x° with 7(p —m)p — m — 1) degrees 
of freedom: 3[0/_,,..43 -(L/L,,414,,)° /(p — m)I/A2,,, 18 asymptotically ? 
with p—m— 1 degrees of freedom. Then (10) has a limiting y *-distribution 
with 4( p —m +2X p —m — 1) degrees of freedom. The hypothesis is rejected 
if the left-hand side of (10) is greater than the upper-tailed significance point 
of the x *-distribution. If the hypothesis is not rejected, the investigator may 
consider the last p—m principal components to be composed entirely of 
error, 

When the units of measurement are not all the same, the three hypotheses 
considered in Section 11.7 have questionable meaning, Corresponding 
hypotheses for the correlation matrix also have doubtful interpretation. 
Moreover, the last criterion does not have (usually) a y?-distribution. More 
discussion is given by Anderson (1963a), 

The criterion (9) corresponds to the sphericity criterion of Section 10.5, 
and the number of degrees of freedom of the corresponding y *-distribution 
is L(p- mXp-—m+4+l)-1, 


11.8. ELLIPFICALLY CONTOURED DISTRIBUTIONS 


11.8.1. Observations Elliptically Contoured 


Let x),.-.,xX, be N observations on a random vector X with density 
(1) IW igl(x—v)'Wo'(x-v)], 


where W is a positive definite matrix, R?=(x-v)W'(x-v), and 
€R' <x, Define e=p&R*/(EFR*Y(p+2)]—-1. Then &X=v=yp and 
&(X- vX\X—v) =(&R?/p)¥ =X, 

The maximum likelihood estimators of the principal components of 2 are 
the characteristic roots and vectors of 2 = (AR? /p)A given by (20) of 
Section 3,6. Alternative estimators are the characteristic roots and vectors of 
S, the unbiased estimator of &. The asymptotic normal distributions of these 
estimators are derived in Section 13.7 (Theorem 13.7.1). Let & = BAB’ and 
S = BLB', where A and L are diagonal and B and B are orthogonal. Let 
D=YN(L- A) and G=V¥N(B-B). Then the limiting distribution of D 
and G is normal with G and D independent. 

The variance of d, is (2 + 3«)A;, and the covariance of d, and d, (i #)) is 
KA,A,, The covariance of g, is 


(2 ¥e(g,) =(1+ rte sg 
) ae £,) = K) 2 (a,-Ar) By: 


k#i 


PROBLEMS 483 
The covariance of g,; and g; is 


AA 
(3) AHE(8,8;))= —(1+K) + B,Bi. 


(A, ~ A;) 


For inference about a single ordered root A, the limiting standard normal 
distribution of ¥N (1; — A,)/(/2(2 + 3%) 1.) can be used, 

For inference about a single vector the right-hand side of (11) in Section 
11.6.2 can be used with § replaced by (1+ «)S and S~’ by S"'/(1 +2). 

It is shown in Section 13.7.1 that the limiting distribution of the logarithm 
of the likelihood ratio criterion for testing the equality of the q=p—m 
smallest roots is the distribution of (1 + «) xg¢g-1) 2-1 


11.8.2. Elliptically Contoured Matrix Distributions 
Suppose the density of X = (x,,..., Xy) is 


[Wl gtr X- eye) WX ey v)] 
=(B|-"?eltraw-! +N(z-v)y wl (z-yv)], 


where 4 =(X — ey x)(X — &'y ¥)’ =nS and n=N-— 1, Thus x and A are a 
sufficient set of statistics. 

Now consider A = YY’ having the density g(tr A). Let A = BLB’, where L 
is diagonal with diagonal elements /; > --- >/, and B is orthogonal with 
Py 2 0. Then L and B are independent; the roots /;,.,.,/, have the density 
(18) of Section 13.7, and the matrix B has the conditional Haar invariant 
distribution, 


PROBLEMS 


1 } 
AL. (Sec. 11.2} Prove that the characteristic vectors of 6 ; are 


\/v2 and We 
1/v2 —1/y2 }’ 


corresponding to roots 1+ 4: and 1— p. 


11.2, (Sec. 11.2} Verify that the proof of Theorem 11.2.1 yields a proof of Theorem 
A.2,1 of the Appendix for any real symmetric matrix. 


484 


11.3. 


M14. 


41.5, 


41.6. 


41.7. 


11.8. 


J 1.9, 


11.10. 


PRINCIPAL COMPONENTS 


(Sec. 11.2) Let z=y+x, where fy =&x=0, Sy’ =@, &um' =07l, &yx' 
= 0, The p components of y can be called systematic parts, and the compo- 
nents of x errors. 


(a) Find the linear combination -y'z of unjt variance that has minimum error 
varjance (j,¢., y'x has minimum variance). 

(b) Suppose ¢,,+o7=1, [=1,.,.,p. Find the linear function +'z of unit 
variance that maximizes the sum of squares of the correlations between z, 
and y’‘z,i=1,...,p. 

(c) Relate these results to principal components, 


(Sec, 11.2) Let E=@+o7/, where @ js positive semidefinite of rank m, 
Prove that each characteristic vector of @ is a vector of £ and each root of & 
is a root of ® plus a”. 


(Sec, 11,2) Let the characteristic roots of 2 be Ay 2 A222 A, 20, 


(a) What is the form of & if A, =A, = ++ =A, > 07? What is the shape of an 
ellipsoid of constant density? 

(b) What is the form of & if Ay > A, = +++ =A, > 0? What js the shape of an 
ellipsoid of constant density? 

(c) What is the form of & if A, =~ =A,_, > A, > 0? What is the shape of 
the ellipsoid of constant density? 


(Sec. 11.2) Jniraclass correlation. Let 
L=o7[Ul-p)itpee’], 


where ¢€ =(1,...,1)’, Show that for p>0, the largest characteristic root is 
o [1 +(p —1)p] and the corresponding characteristic vector is e. Show that if 
e’x=0, then x is a characteristic vector corresponding to the root o7(1 — p). 
Show that the root o 7(1 — p) has multiplicity p — 1. 


(Sec, 11.3) In the example of Section 9,6, consider the three pressing opera- 
tions (x2, x4, x5). Find the first principal component of this estimated covari- 
ance matrix. [Hint: Start with the vector (1, 1,1) and iterate.) 


(Sec. 11.3) Prove directly the sample analog of Theorem 11.2.1, where Dx, = 
0, 2x, x, =A. 


(Sec, 11.3) Let /; and /, be the largest and smallest characteristic roots of S, 
respectively, Prove &/, 2 A, and d/,<A,. 


(Sec. 11.3) Let U, =P"'X be the first population principal component with 
variance Y(U,)= A,, and let V, =6"'X be the first sample principal compo- 
nent with (sample) variance /, (based on S$), Let S* be the covariance matrix 
of a second (independent) sample. Show £6 'S*B() < A). 


PROBLEMS 485 


il. 11. (Sec, 11.3) Suppose that o,,>0 for every i,j [2 =(o;,)]. Show that (a) the 
coefficients of the first principal component are all of the same sign, and 
(b) the coefficients of each other principal component cannot be all of the 
same sign. 


11.12. (Sec. 11.4) Prove (4) when A, > Az. 
(a) Show &' = BA‘B’. 
(b) Show 


aye 
yu = 1,BA‘B xan = 1B 5A) Bx). 


where 1,=[Ti_9s; and s;= I/ Vara 
(c) Show 
lim {<a} -E 
ee ¥ I> 


where E,, has 1 in the upper left-hand position and 0’s elsewhere. 
(d) Show lim, _, .£1,A})? = 1/(BU'x(y))?, 
(e) Conclude the proof, 


11.13. (Sec. 11.4) Let 


| 90 
0 HY 


where H=1,_,;—2aa@’ and a has p—1 components, Show that a can be 
chosen so that in 


t 
Or, a. 


LS Ho, HH 


HG, has all 0 components except the first. 


11.14, (Sec, 11.6) Show that 


log !, - y = 36) < log A; <log/,+ V e z(e) 


is a confidence interval for log A, with approximate confidence 1 - e, 
11.15, (Sec. 11.6) Prove that #’ <a if /'’=/ and p> 2. 


11.16. (Sec. 11.6) Prove that u<u* if /=/* and p> 2. where /* and u™ are the / 
and u of Section 10.8.4, 


486 PRINCIPAL COMPONENTS 


11.17. The lengths, widths, and heights (in millimeters) of 24 male painted turtles 
[Jolicoeur and Mosimann (1960)] are given below. Find the (sample) principz1 
components and their variances. 


Case Case 
No, Length Width Height No. Length Width Height 
1 93 74 37 13 116 90 43 
2 94 78 35 14 117 90 41 
3 96 80 35 15 117 91 Al 
4 101 84 39 16 119 93 4l 
5 102 85 38 17 120 89 40 
6 103 81 37 18 120 93 44 
7 104 83 39 19 121 95 42 
8 106 83 39 20 125 93 45 
9 107 82 38 21 127 96 45 
10 112 89 40 22 128 95 46 
11 113 88 40 23 131 95 46 


12 114 86 40 24 135 106 47 


CHAPTER 12 


Canonical Correlations 
and Canonical Variables 


12.1. INTRODUCTION 


In this section we consider two sets of variates with a joint distribution, and 
we analyze the correlations between the variables of one set and those of the 
other set. We find a new coordinate system in the space of each set of 
variates in such a way that the new coordinates display unambiguously the 
system of correlation. More precisely, we find linear combinations of vari- 
ables in the sets that have maximum correlation; these linear combinations 
are the first coordinates in the new systems. Then a second linear combina- 
tion in each set is Sought such that the correlation between these is the 
maximum of correlations between such linear combinations as are uncorre- 
lated with the first linear combinations. The procedure is continued until the 
two new coordinate systems are completely specified. 

The statistical method outlined is of particular usefulness in exploratory 
studies. The investigator may have two large sets of variates and may want 
to study the interrelations. If the two sets are very large, he may want 
to consider only a few linear combinations of each set. Then he will want to 
study those l.near combinations most highly correlated. For example, one set 
of variables may be measurements of physical characteristics, such as various 
lengths and breadths of skulls; the other variables may be measurements of 
mental characteristics, such as scores on intelligence tests, If the investigator 
is interested in relating these, he may find that the interrelation ts almost 


An Introduction to Multivariate Statistical Analysis, Third Edition. By T, W. Anderson 
ISBN 0-471-36091-0 Copyright © 2003 John Wiley & Sons, Inc, 


487 


488 CANONICAL CORRELATIONS AND CANONICAL VARIABLES 


completely described by the correlation between the first few canonical 
variates. 

The basic theory was developed by Hotelling (1935), (1936). 

In Section 12.2 the canonical correlations and variates in the population 
are defined; they imply a linear transformation to canonical form. Maximum 
likelihood estimators are sample analogs. Tests of independence and of the 
rank of a correlation matrix are developed on the basis of asymptotic theory 
in Section 12.4. 

Another formulation of canonical correlations and variates is made in the 
case of one set being random and the other set consisting of nonstochastic 
variables; the expected values of the random variables are linear combina- 
tions of the nonstochastic variables (Section 12.6). This is the model of 
Section 8.2. One set of canonical variables consists of linear combinations of 
the random variables and the other set consists of the nonstochastic vari- 
ables; the effect of the rcgrcssion of a member of the first sct on a member of 
the second is maximized. Linear functional relationships are studied in this 
framework. 

Simultaneous equations models are studied in Section 12.7. Estimation of 
a single equation in this model is formally identical to estimation of a single 
linear functional relationship. The limited-information maximum likelihood 
estimator and the two-stage least squares estimator are developed. 


12.2, CANONICAL CORRELATIONS AND VARIATES 
IN THE POPULATION 


Suppose the random vector X of p components has the covariaice matrix & 
(which is assumed to be positive definite). Since we are only interested in 
variances and covariances in this chapter, we shall assume &X=0 when 
treating the population. In developing the concepts and algebra we do not 
need to assume that X is normally distributed, though this latter assumption 
will be made to develop sampling theory. 

We partition X into two subvectors of p, and p, components, respec- 
tively, 


x 
(1) x-(¥5) 


For convenience we shall assume p, <7,. The covariarce matrix is parti- 
tioned similarly into p, and p, rows and columns, 


a = 
2 y= . 
a Ge 3a 


12.2 CORRELATIONS AND VARIATES IN THE POPULATION 489 


In the previous chapter we developed a rotation of coordinate axes to a new 
system in which the variance properties were clearly exhibited. Here we shall 
develop a transformation of the first p, coordinate axes and a transformation 
of the last p, coordinate axes to a new (p, +p,)-system that will exhibit 
clearly the intercorrelations between X“? and X™, 

Consider an acbitrary linear combination, U = aX“, of the components 
of X, and an arbitrary linear function, V= y'X™, of the components of 
X®, We first ask for the linear functions that have maximum correlation. 
Since the correlation of a multiple of U and a multiple of V is the same as 
the correlation of U and V, we can make an arbitrary normalization of « and 
y. We therefore require @ and y to be such that U and V have unit 
variance, that is, 


(3) l= 8U* = €a' XX! aQ=a'd a, 
(4) L= EV? = by XOXO y= y' Loy. 


We note that €U = &a'X') =a'&X' =0 and similarly €V = 0. Then the 
correlation between U and V is 


(5) EUV = EalXOXO' y= a’ Soy. 
Thus the algebraic problem is to find « and y to maximize (5) subject to (3) 
and (4). 
Let 
(6) P= o' DLyy~ A(a' Da — 1) -Fe(y'Zwy- 1), 


where A and p are Lagrange multipliers. We differentiate y% with respect to 
the elements of a and y. The vectors of derivatives set equal to zero are 


o 

(7) oF By AZ a=, 
os 

(8) s = My @ — BLwY =, 


Multiplication of (7) on the left by a’ and (8) on the left by y’ gives 
(9) a’ Nyy — Aa’>,,a=0, 
(10) y' Ppa - py’ Zyy = 0. 


Since a'Y ,,@ = 1 and y’d.y = 1, this shows that A= p = a’ oy. Thus (7) 


490 CANONICALCORRELATIONS AND CANONICAL VARIABLES 


and (8) can be written as 


(11) —AL, a+ Zp y¥ =0, 
(12) Za - Any =O, 


since &'}, = .,. In one matrix equation this is 


Qa 


Y 


(13) —AXy Zio 
Xa ~AXo 


In order that there be a nontrivial solution {which is necessary for a solution 
satisfying (3) and (4)], the matrix on the left must be singular; that is, 


~AZy X12 


14 
( ) Lo —ALn 


= (0. 


The determinant on the left is a polynomial of degree p. To demonstrate 
this, consider a Laplace expansion by minors of the first p, columns. One 
term is | —AL | -| — AD] =(—A)P*P2|E,,]-|2,,|. The other terms in the 
expansion are of lower degree in A because one or more rows of each minor 
in the first p, columns does not contain A, Since & is positive definite, 
1X! -125.] #0 (Corollary A.1.3 of the Appendix). This shows that (14) is a 
polynomial equation of degree p and has p roots, say A; >A, > °° = A,.[a’ 
and y’ complex conjugate in (9) and (10) prove A real.] 

From (9) we see that A= a’ ,,¥ is the correlation between U = a'X“ 
and V=y'X™ when a and y satisfy (13) for some value of A. Since we 
want the maximum correlation, we take A=A,. Let a solution to (13) for 
A=A, be a) y™) and let U =a’ xX and V, = y¥'X™ Then U; and V, 
are normalized linear combinations of X“) and X, respectively, with maxi- 
mum correlation. 

We now consider finding a second linear combination of X"), say U'= 
a’X) and a second linear combination of X, say V = y'X™, such that of 
all linear combinations uncorrelated with U, and V, these have maximum 
correlation. This procedure is continued. At the rth step we have obtained 
linear combinations U, =a 0"X®, Vo =yO XU =a XO, Y= 
y'X@ with corresponding correlations [roots of (14)] Af? = A,, AM,..., AM. 
We ask for a linear combination of X, U=a'X, and a linear combina 
tion of X, V = -y'X®), that among all linear combinations uncorrelated with 
UV,,...,U,,4,, have maximum correlation. The condition that U’ be uncor- 
related with U. is 


(15) 0= UU, = Ea’ X9XM'G = a'E 0, 


12.2 CORRELATIONS AND VARIATES IN THE POPULATION 491 
Then 

(16) EUV, = aL YO =AMMa'Y a = 0, 

The condition that V be uncorrelated with V, is 

(17) 0=€W=y'dny™. 

By the same argument we have 

(18) &VU, = y'X ya =0. 


We now maximize ¢U,,,V,,,, choosing a and y to satisfy (3), (4), (15), 
and (17) for i= 1,2,...,r. Consider 


(19) We) =a’ Dy 7A(a' ya — 1) — Fu(y'Zny-1) 


r r 
+ - v,a' Xa + » OY Sn, 
i=! i=1 


where A, p,v;,-..,4%,9,,-.-,6 are Lagrange multipliers. The vectors of 
partial derivatives of y,,, with respect to the elements of a and y are set 
equal to zero, giving 


; F é , 

(20) Mtl = Sy AZO t Ly dja =, 
t=] 

1 OW, 41 = - O=9 

(21) OY =Bya-pinyt VOkyy? =O. 


t=] 

Multiplication of (20) on the left by a’ and (21) on the left by y”” gives 
(22) 0=vaZ, a =v, 

(23) 0= Oy Sny = 6. 


Thus (20) and (21) are simply (11) and (12) or alternatively (13). We therefore 
take the largest A,, say, A‘+”, such that there is a solution to (13) satisfying 
(3), (4), (15), and (17) for i=1,..., r. Let this solution be a *”, y+”, and 
let U,, = at 'X and VY, = yt DX, 

This procedure is continued step by step as long as successive solutions 
can be found which satisfy the conditions, namely, (13) for some 4,, (3), (4), 
(15), and (17). Let m be the number of steps for which this can be done. Now 


492 CANONICAL CORRELATIONS AND CANONIC4L VARIABLES 


we shall show that m=p,, (<p). Let A=(a - af), Vp=(y 
y(™), and 


OD Q a 0) 
AQ ae 
(24) Aa|r 
0 0 vee QO) 


The conditions (3) and (15) can be summarized as 
(25) A'S, AHI. 


Since &,, is of rank p, and J is of rank m, we have m <p,. Now let us show 
that m <p, leads to a contradiction by showing that in this case there is 
another vector satisfying the conditions. Since A’2,, is m X p;, there exists a 
Pp, X(p,—m) matrix E (of rank p,—m) such that A’S,,£=0. Similarly 
there is a p,X(p,—m) matrix F (of rank p,—m) such that 2, F =0. 
We also have 1 2,, EZ = AA’S,,£=0 and A’), F= AY 22, F= 0, Since E 
is of rank p,; —m, E'%,,£ is nonsingular (if m <p,), and similarly F’S.F 
is nonsingular, Thus there is at least one root of 


—vE'S,E E'S F 
(26) : ; =0, 
F DoE —vF LaF 


because |E’>,,E|-|F'X..F| #0. From the preceding algebra we see that 
there exist vectors a and b such that 


(27) E'S, Fb = vE'S,, Ea, 
(28) F'S, Ea = vF'Sy FD. 


Let Ea =g and Fh=h. We now want to show that v, g, and hk form a 
new solution AU™*) gl") yi") Let Si Xk =k. Since AD, k= 
A'S... Fb = 0, k is orthogonal to the rows of A’,, and therefore is a lineer 
combination of the columns of E, say Ec. Thus the equation 2),h = 2k 
can be written 


(29) 1, Fb = 2, Ec. 
Multiplication by E’ on the left gives 
(30) E'S, Fb = E'S, Ec. 


Since E'S, E is nonsingular, comparison of (27) and (30) shows that c = va, 


12.2 CORRELATIONS AND VARIATES IN THE POPULATION 493 


and therefore k = vg. Thus 

(31) Dyk = vd. 
In a similar fashion we show that 

(32) Lag =Viyh. 


Therefore v= "*) goal™*”) h=y"*” is another solution. But this is 
contrary to the assumption that A°”, @”), y(” was the last possible solution. 
Thus m =p). 

The conditions on the 4’s, w’s and yy’s can be summarized as 


(33) A'S Aa, 
(34) A'S 0, =A, 
(35) idol, =f, 


Let T, =(y(?t*) ++. y("2)) be a p. X (p,— p,) matrix satisfying 
(36) P21) = 0, 
(37) VY, Zaol, =I. 


Any I, can be multiplied on the right by an arbitrary (p,—p,) X(p.-p)) 
orthogonal matrix. This matrix can be formed one column at a time: y‘?:*" 
is a vector orthogonal to &,,1, and normalized so y'?!*)'Z,,yt) = 1; 
y(t?) ig a vector orthogonal to %(¥, y‘?'t) and normalized so 
ry (Pit) y(P+2 = 1; and so forth. Let [ =(, I,); this square matrix is 
nonsingular since ["'%..1 = J. Consider the determinant 


A’. OO 
(28) 0 m : ~ADy X12 ; A 0 0 
0 r;| 221 — AL» 0 7, Y, 
2 
-xI A 0 
oa A —-dALI 0 
0 0 ~-ArI 
fer P| WAL A 
=e) A s 


=(-A)?P' ~an-| -AT-A(-AD) ‘AI 
a ( —)A)yPP' 2F A? | 
=(-A)y? PTC? os Amy, 


4194 CANONICAL CORRELATIONS AND CANONICAL VARIABLES 


Except for a constant factor the above polynomial is 


—ADy ay) 
Zo —AXy 


. 


(39) 


Thus the roots of (14) are the roots of (38) set equal to zero, namely, 
A= +A), i=1,...,p,, and A=0 (of multiplicity p, — p,). Thus (A),...,A,) 
=(Aj,-.64Ap,0,-..,0,— A, 4---) — Ay), The set (A), i=1,..., p,, is the set 
{A?), i= 1,..., Pp; To show that the set {A}, i=1,..., p,, is the set (A,}, 
i=1,..., p,, we only need to show that A“ is nonnegative (and therefore is 
one of the A,, i=1,..., p,). We observe that 


(40) Loy? = —NOZy( -a), 


(41) Ba( a) = -AaAP Ly y™: 


thus, if A®,a,y© is a solution, so is ~A%,~ a, y. If AM were 
negative, then — A‘ would be nonnegative and —A‘” 2 A“. But since A” 
was to be maximum, we must have A‘ > ~ A and therefore A“) > 0. Since 
the set {A} is the same as {A,}, i=1,..., p,, we must have A“ = A, 

Let 


(42) U=| - | =a'x®, 


(43) vOe} | amx, 


(44) yO = : = 1x, 


The components of U are one set of canonical variates, and the components 


12.2 CORRELATIONS AND VARIATES IN THE POPULATION 495 


of V=(V" y®*) are the other set. We have 


U A’ 0 
aes z 0 
(45) &) vo (u' yo yo')= on i" 12 (i r,| 
yo 0 i Dy Zw I 2 
I, A 
a P\ 0 A 
0 0 Ler 
where 
A, 0 0 
0 A, 0 
(46) A= ; | 
0 0 ode nN 


Py 


Definition 12.2.1. Let X=(X9" X®@’Y, where X\ has p, components 
and X® has p) (=p — p, =p,) components. The rth pair of canonical variates 
is the pair of linear combinations U, = a\?'X and V, = y'X®, each of unit 
variance and uncorrelated with the first r—1 pairs of canonical variates and 
having maximum correlation. The correlation is the rth canonical correlation. 


Theorem 12.2.1. Let X =(X"" X")' be a random vector with covariance 
matrix %, The rth canonical correlation between X and X is the rth largest 
root of (14). The coefficients of a 'X™ and y'X® defining the rth pair of 
canonical variates satisfy (13) for A= A, and (3) and (4). 


We can now verify (without differentiation) that U,,V, have maximum 
correlation. The linear combinations @'U =(a’A’)X" and b'V=(b'T')X® 
are normalized by a’a = | and b’b = 1. Since A and I are nonsingular, any 
vector a can be written as Aa and any vector y can be written as ['b, and 
hence any linear combinations a’X™ and y'’X™ can be written as a’U and 
b'V. The correlation between them is 


(47) a’(A 0)b= > ,a,6,. 


is] 


Let A,2,/y L( A,a,)° =c, Then the maximum of 
a’(A 0)b = L(A,4,)° Leb, 


496 CANONICAL CORRELATIONS AND CANONICAL VARIABLES 


with respect to b is for b, =c,, since Lc,b, is the cosine of the angle between 
the vector b and (c,,...,¢,,0,..., 0). Then (47) is 


tah ppm 


YUAa? = ERA? az)a? +22, 


and this is maximized by taking ¢,=0, i=2,...,p,. Thus the maximized 
linear combinations are U, and Vy. In verifying that U, and V, form the 
second pair of canonical variates we note that lack of correlation between U, 
and a linear combination a'U means 0 = €U,a'U = €U,La,U, =a, and lack 
of correlation between V, and b’V means 0=5,. The algebra used above 
gives the desired result with sums starting with i = 2. 

We can derive a single matrix equation for a or y. If we multiply (11) by A 
and (12) by 23,1, we have 


(48) ALpy=NV Bye, 
(49) 2X7) L,,a= AY. 


Substitution from (49) into (48) gives 


(50) Ligh LyA= NZ 
or 
(S1) (Zp Bz Ly — AZ y)a=0. 


oes 2 2 e 
The quantities Ay,..., Ap, Satisfy 


(52) |X, %o7 Ly) — vEy| =0, 
and a,,,,, a?” satisfy (51) for A? = Aj,...,A3,, respective'y. The similar 
equations for y,...,y°?2) occur when \’ = Aj,..., 3, are substituted with 


(53) (2p Bi Zp-Vy)y=O. 


Theorem 12.2.2, The canonical correlations are invariant with respect to 
transformations X“)* = C,X, where C, is nonsingular, i = 1,2, and any func- 
tion of & that is invariant is a function of the canonical correlations. 


Proof, Equation (14) is transformed to 


(54) 
i — AC BC; C)22C 


; Cc, 0 
Cy ZC; “ AC, LC, 


0 ¢, 


—ALy 21 
Xo) ~ALy 


12.2 CORRELATIONS AND VARIATES IN THE POPULATION 497 


and hence the roots are unchanged. Conversely, let f(24,, 22,2») be a 
vector-valued function of & such that f(C,%,,C),C,2).C,,C,2.4C.) = 
f(%)), Z2. Z2) for all nonsingular C,; and C,. lf C; =A and C,= VT’, then 
(54) is (38), which depends only on the canonical correlations. Then f= 
FUL(A, 0), D. 7 


We can make another interpretation of these developments in terms of 
prediction. Consider two random variables U and V with means 0 and 
variances g,? and a,’ and correlation p. Consider approximating U by a 
multiple of V, say bV; then the mean squared error of approximation is 


(55) 6(U-bV) =02-2beo,p + b's? 
= 9, (1 -- p*) +( ba, - pa)’. 


This is minimized by taking b = 9, p/o,. We can consider bV as a linear 
prediction of U from V; then o,7(1 ~ p*) is the mean squared error of 
prediction. The ratio of the mean squared error of prediction to the variance 
of U is o2(1 — p*)/a,? = 1 — p?; the complement is a measure of the relative 
effect of V on U or the relative effectiveness of V in predicting U. Thus the 
greater p” or |p| is, the more effective is V in predicting U. 

Now consider the random vector X partitioned according to (1), and 
consider using a linear combination V = y’X™ to predict a linear combina- 
tion U=a'X"), Then V predicts U best if the correlation between L’ and V 
is a maximum. Thus we can say that oc{''’X" is the linear combination of 
xX that can be predicted best, and y‘’)’X® is the best predictor [Hotelling 
(1935)]. 

The mean squared effect of V on U can be measured as 
2 “i EVr= pa;, 
0, 


ry 


(56) E(bV) =p 


and the relative mean squared effect can be measured by the ratio 
&(bV)*/€U* = p’. Thus maximum effect of a linear combination of X* on 
a linear combination of X“ is made by y""X on a!" X, 

In the special case of p, = 1, the one canonical correlation is the multiple 
correlation between X‘) =X, and X™, 

The definition of canonical variates and correlations was made in terms of 
the covariance matrix X= @(X- @XXX—- &X)'. We could extend this 
treatment by starting with a normally distributed vector Y with p+p, 
components and define X as the vector having the conditional distribution of 
the first p components of Y given the value of the last p, components. This 


498 CANONICAL CORRELATIONS ANG CANONICAL VARIABLES 


would mean treating X, with mean &X,=@yvY; the elements of the 
covariance matrix would be the partial covariances of the first p elements 
of Y. 

The interpretation of canonical variates may be facilitated by considering 
the correlations between the canonical variates and the components of the 
original vectors [e.g., Darlington, Weinberg, and Wahlberg (1973)]. The 
covariance between the jth canonical variate U, and X; is 

Pi PL 
(57) EUX,= EY ofPX,X,= KY afPoy. 
k=1 k=1 


Since the variance of U, is 1, the correlation between U, and X, is 


t 


= Cha apo, 


An advantage of this measure is that it does not depend on the units of 
measurement of X,. However, it is not a scalar multiple of the weight of X; 
in U, (namely, @})). 

A special case is %\, =J, 2%, =J. Then 


(58) Cor(U, X,) 


(59) A'A=T, r= f, AX »F=(A_ 0). 
From these we obtain 
(60) 21. = AC A oy’, 


where A and I are orthogonal and A is diagonal. This relationship is known 
as the singular value decomposition of %\.. The elements of A are the square 
roots of the characteristic roots of %\,%/,, and the columns of A are 
characteristic vectors. The diagonal elements of A are square roots of the 
(possibly nonzero) roots of &',%,,, and the columns of I are the character- 
istic vectors. 


12.3. ESTIMATION OF CANONICAL CORRELATIONS 
AND VARIATES 


12.3.1. Estimation 


Let x,,.-.,”, be N observations from N(p, %). Let x, be partitioned into 
two subvectors of p, and p, components, respectively, 


x) 
(1} » ; y a=1,...,N. 


x?) 


12.3. ESTIMATION OF CANONICAL CORRELATIONS AND VARIATES 499 


The maximum likelihood estimator of & [partitioned as in (2) of Section 12.2] 
is 


(2) 3 [2 En aes x (x ~%)(x,-¥)' 
\ 221 2 Ne . 
1 [E(x — 2) (x - 2)" L(x 2) (x — 2)! 
aa 4 E(x ~ F) (x0) — 2)! L (x2 —¥) (x@ — 70): ; 


The maximum likelihood estimators of the canonical correlations A and 
the canonical variates defined by A and I" involve applying the algebra of the 
previous section to &. The matrices A, A, and J; are uniquely defined if we 
assume the canonical correlations different and that the first nonzero ele- 
ment of each column of A is positive. The indeterminacy in JY, allows 
multiplication on the right by a (p, —p,) X (p, — p;) orthogonal matrix; this 
indeterminacy can be renioved by various types of requirements, for example, 
that the submatrix formed by the lower p,—p, rows be upper or lower 
triangular with positive diagonal elements. Application of Corollary 3.2.1 
then shows that the maximum likelihood estimators of Aj,...,A, are the 
roots of 


—1% x 
(3) mu “121 20, 
Za 7! Diag 
and the jth columns of A and F satisfy 
es oT 25 &) 
(4) : =0, 


La <1 2x a 


(5) GOS AN = 1, GIVE GO= 1. 
r, satisfies 

(6) 21, =0, 

(7) hf, =H, 


A 


When the other resirictions on [, are made, A, Yr, and A are uniquely 
defined. 


500 CANONICAL CORRELATIONS AND CANONICAL VARIABLES 


Theorem 12.3.1, Letx,,..., x, be N observations from N(p, %). Let & be 
partitioned into p, and p, (p, <P) rows and columns as in (2) in Section 12.2, 
and let x, be similarly partitioned as in (1). The maximum likelihood estimators 
of the canonical correlations are the roots of (3), where i are defined by (2). 
The maximum likelihood estimators of the coefficients of the jth canonical 


components satisfy (4) and (5), j=1,..., py; the remaining components satisfy 
(6) and (7). 


In the population the canonical correlations and canonical variates were 
found in terms of maximizing correlations of linear combinations of two sets 
of variates. The entire argument can be carried out in terms of the sample. 
Thus & x") and 4(’x@) have maximum sample correlation between any 
linear combinations of x and x®, and this correlation is /,. Similarly, 
&'x) and 4x have the second maximum sample correlation, and so 
forth. 

It may also be observed that we could define the sample canonical variates 
and correlations in terms of S, the unbiased estimator of ©. Then a‘) 


= Y(N=1)/N4&, c= Y(N—1)/N4, and 1, satisfy 


(8) Sc) = 18,4, 
(9) S) qQ)= LS ce, 
(10) a'S,,a) = 1, cS c= 1. 


We shall call the linear combinations a x™ and c''x?) the sample 
canonical variates, 

We can also derive the sample canonical variates from the sample correla- 
tion matrix, 


(11) R= eel | a |= cry [R a 
V6, V6, $,;5,, : Ry, Rx» 
Let 
vn 0 0 
(12) S,= : Vin : 


12.3 ESTIMATION OF CANONICAL CORRELATIONS AND VARIATES 501 


V Spitlepre! 0 a 0 


we] 9 Vou 0 


pp 


Then we can write (8) through (10) as 


(14) Rio(S.e) = 1 Ry( $0), 
(15) R,,(S,a””) =1,Ry(S,e), 
(16) (S\a)'R, (Sa) =1, — (Sye)'Ry (Spe) = 1. 


We can give these developments a geometric interpretation. The rows of 
the matrix (x,,...,%,) can be interpreted as p vectors in an N-dimensional 
space, and the rows of (x, ~x,...,Xy —X) are the p vectors projected on the 
(N — 1)-dimensional subspace orthogonal to the equiangular line. Denote 


these as x},...,x*. Any vector u* with components a‘(x{!) —x,..., x — 
X%)=ayxf ++ +a, x% is in the p,-space spanned by xf,...,xf,, and a 


vector v* with componnts yx? — ¥O,..., 20) — FO) = yxk 
+--+, x% is in the p,-space spanned by x% ,;,...,x3. The cosine of the 
angle between these two vectors is the correlation between u, = a'x" and 
u,=y'x2, a=1,...,N.. Finding a and y to maximize the correlation is 
equivalent to finding the vectors in the p,-space and the p,-space such that 
the angle between them is least (ie., has the greatest cosine). This gives the 
first canonical variates, and the first canonical correlation is the cosine of the 
angle. Similarly, the second canonical variates correspond to vectors orthogo- 
nal to the first canonical variates and with the angle minimized. 


12.3.2. Computation 


We shall discuss briefly computation in terms of the population quantities. 
Equations (50), (51), or (52) of Section 12.2 can be used. The computation of 
21, 3, 2, can be accomplished by solving %,,=%..F for %5,'X., and 
then multiplying by %,,. If p, is sufficiently small, the determinant 
[2.25 2, —v%,,| can be expanded into a polynomial in v, and the 
polynomial equation may be solved for v. The solutions are then inserted 
into (51) to arrive at the vectors a. 

In many cases p, is too large for this procedure to be efficient. Then one 
can use an iterative procedure 


(17) By) ZB Zya(i) =A (it Ye(i+ 1), 


502 CANONICAL CORRELATIONS AND CANONICAL VARIABLES 


starting with an initial approximation «(0); the vector a(i+1) may be 
normalized by 


(18) a(i+1)'S,,a(i+1) =1. 
The A7(i+ 1) converges to At and a(i+1) converges to a (if A, > A). 


This can be demonstrated in a fashion similar to that used for principal 
components, using 


(19) Sy} EpRal By AMAT! 
from (45) of Section 12.2, See Problem 12.9. 

The right-hand side of (19) is DP!, a7", where &' is the ith row of 
A’. From the fact that A’S,,A=J, we find that A’S,, =A! and thus 


Pi 


(20) i Xig Ba > a Neo) ae)" a a 0 OA EO 
=2 
0 0 : 0 
0 3 ss (0 
= A : : : Av, 
0 0 2, 


The maximum characteristic root of this matrix is Aj. If we now use this 
matrix for iteration, we will obtain A} and «a, The procedure is continued 
to find as many 47 and a“) as desired. 

Given A, and a“, we find y® from ¥,0 =A, or equivalently 
(1/A, Ez! % 5,00 = y. A check on the computations is provided by com- 
paring Xj y? and A,X 0, 

For the sample we perform these calculations with pe or S,; substituted 
for %,,. It is often convenient to use R;, in the computation (because 
—1 <r, <1) to obtain Sa and S,c; from these a‘) and ce can be 
computed. 

Modern computational procedures are available for canonical correlations 
and variates similar to those sketched for principal components. Let 


4 


(21) Z, = (x? -2%,..., 2 — x0), 


(22) Z = (xP — ¥,..., xf) — xO). 


12.4 STATISTICAL INFERENCE 503 


The QR decomposition of the transpose of these matrices (Section 11.4) is 
Z,= Q;R,, where Q.Q,=1, and R, is upper triangular. Then S,,=Z,Z, = 
RiQiO;R;, i, j= 1,2, and §,,= Ri&,, (= 1,2. The canonical correlations are 
the singular values of QQ, and the square roots of the characteristic roots of 
(Q,0.X0,2,)' (by Theorem 12.2.2), Then the singular value decomposition 
of @,Q, is P(L O)T, where P and T are orthogonal and L is diagonal. To 
effect the decomposition Householder transformations are applied to the left 
and right of QQ, to obtain an upper bidiagonal matrix, that is, a matrix with 
entries on the main diagonal and first superdiagonal. Givens matrices are 
used to reduce this matrix to a matrix that is diagonal to the degree of 
approximation required. For more detail see Kennedy and Gentle (1980), 
Section 7.2 and 12.2, Chambers (1977), Bjérck and Golub (1973), Golub and 
Luk (1976), and Golub and Van Loan (1989). 


12.4. STATISTICAL INFERENCE 


12.4.1. Tests of Independence and of Rank 


In Chapter 9 we considered testing the null hypothesis that X® and X® are 
independent, which is equivalent to the null hypothesis that 2%. = 90. Since 
A’, =(A 0), it is seen that the hypothesis is equivalent to A=, that is, 
p= ** = p,, = 0. The likelihood ratio criterion for testing this null hypothe- 
sis is the N/2 power of 


where r,=/,;> > or, =1, > O are the p, possibly nonzero sample canoni- 
cal correlations. Under the null hypothesis, the limiting distribution of 
Bartlett's modification of —2 times the logarithm of the likelihood ratio 
criterion, namely, 


(2) ~[N-4(p+3)] ¥ log(1 - #2), 


i=1 


504 CANONICAL CORRELATIONS AND CANONICAL VARIABLES 


is y? with p,p, degrees of freedom. (See Section 9.4.) Note that it is 
approximately 


Py 
(3) N Yin =NtrAj'A 43) 4y, 


f=] 


which is N times Nagao’s criterion [(2) of Section 9.5]. 

If 2%). #0, an interesting question is how many population canonical 
correlations are different from 0; that is, how many canonical variates are 
needed to explain the correlations between X‘" and X@? The number of 
nonzero canonical correlations is equal to the rank of 2,.. The likelihood 
ratio criterion for testing the null hypothesis Hy: py,1 = °** = pp, = 0, that is, 
that the rank of %,, is not greater than k, is T12!,,,(1— 17)?" [Pujikoshi 
(1974)]. Under the null hypothesis 


(4) -[N=4(p+3)] SS log(1~r?) 


f=k+2 


has approximately the y7-distribution with (p, — kX p,—k) degrees of free- 
dom. [Glynn and Muirhead (1978) suggest multiplying the sum in (4) by 
N—k-}(p+3)+2L*,(1/r7); see also Lawley (1959).] 

To determine the numbers of nonzero and zero population canonical 
correlations one can test that all the roots are 0; if that hypothesis is rejected, 
test that the p, — 1 smallest roots are 0; etc. Of course, these procedures are 
not statistically independent, even asymptotically. Alternatively, one could 
use a sequence of tests in the opposite direction: Test p, = 0, then p, 1 = 
P,, = 9, and so on, until a hypothesis is rejected or until 2). = 0 is accepted. 
Yet another procedure (which can only be carried out for small p,) is to test 
Pp, = 0, then ie 0, and so forth. In this procedure one would use r, to 
test the hypothesis p,=0. The relevant asymptotic distribution will be 
discussed in Section 12.4.2. 


12.4.2. Distributions of Canonical Correlaticns 


The density of the canonical correlations is given in Section 13.4 for the case 
that %,, = 0, that is, all the population correlations are 0, The density when 
some population correlations are different from 0 has been given by Constan- 
tine (1963) in terms of a hypergeometric function of two matrix arguments. 

The large-sample theory is more manageable. Suppose the first k canoni- 
cal correlations are positive, less than 1, and different, and suppose that 


125 AN EXAMPLE 505 


Pp, —k correlations are 0. Let 


r? — p? 

z=VN so P= Decevs k 
(5) 2p,(1— 97) 

z,= Nr?, f=k4+1,....p). 
Then in the limiting distribution z,,...,z, and the set 2,,,,....2, are 


t 
mutually indcpendent, z, has the limiting distribution N(Q.1). f= Leccacke 


and the density of the limiting distribution of Zee eee IS 


4(p,-k) L 
we? exp( — 5LP e412.) 


Pr P| 
; Tl geez) Tl (27): 
i=ke+l iypektl 


toy 


This is the density (11) of Section 13.3 of the characteristic roots of a 
(p, ~ k)-order matrix with distribution WU, _,, p> — ). Note that the nor- 
malizing factor for the squared correlations corresponding to nonzero popu- 
lation correlations is VN, while the factor corresponding to zero population 
correlation is N. See Chapter 13. 

In large samples we treat r? as N[p’,(1/NMp7(1— p7)"] or r, as 
NL p,,C1/N X1 — p?)?] (by Theorem 4.2.3) to obtain tests of p, or confidence 
intervals for p, Lawley (1959) has shown that the transformation z, = 
tanh~!(r,) [see Section 4.2.3] does not stabilize the variance and has a 
significant bias in estimating ¢,=tanh7!( p,). 


12.5. AN EXAMPLE 


In this section we consider a simple illustrative example. Rao [(1952), p. 245] 
gives some measurements on the first and second adult sons in a sample of 25 
families. (These have been used in Problem 3.1 and Probleim 4.41.) Let .v,, 
be the head length of the first son in the ath family, x,, be the head breadth 
of the first son, x,, be the head length of the second son, and x,, be the 
head breadth of the second son. We shall investigate the relations between 
the measurements for the first son and for the second. Thus x!) =(r,,..¥:,) 


506 CANONICAL CORRELATIONS AND CANONICAL VARIABLES 


and x!’ =(x,;,.x,,). The data can be summarized as' 


(1) x’ = (185.72. 151.12, 183.84, 149.24), 


95.2933 52.8683 69.6617 46.1117 

g— | 52.8683 54.3600 51.3117 35.0533 EB és 
69.6617 51.3117 100.8067 56.5400 
46.1117 35.0533 56.5400 45.0233 


The matrix of correlations is 


1.0000 0.7346 } 
0.7346 1.0000 | 0.6932 0.7086] [Ru Re 
0.7108 0.6932 | 


0.7040 0.7086 | 0.8392 1.0000 


All of the correlations are about 0.7 except for the correlation between the 
two measurements on second sons. In particular, Ry is nearly of rank one, 
and hence the second canonical will be near zero, 

We compute 


-ip _ {0.405769 0.333 205 
“) R3Ry = (0 563.480 ear 

-tp .. {0.544311 0.538841 
(4) Rieke f= bee Vaal 


The determinantal equation is 


0.544311 —1.0000v 0.538841 — 0.7346r 
0.538 841 — 0.73461 0.534950 — 1.0000v 


= 0,460 363 v? — 0.287596v + 0.000 830. 


(5) 0= 


The roots are 0.621816 and 0.002900, thus /, = 0.788553 and /, = 0.053 852 
Corresponding to these roots are the vectors 


0.552 166 1.366 501 
(6) Sa) = : S,a é , 
0.521 548 — 1.378 467 
where 
Vou 0 9.7618 0 
(7) S, = = : 
0 $34 0 7.3729 


t 7 . . . tories ” A 
Rao’s computations are in error his last “difference™ is incorrect. 


125 AN EXAMPLE 507 


We apply (1/1,)Rz! R2, to S,a? to obtain 


0.504511 1.767 281 
qd) (2) — 
(8) Se ee eae eae 24959 al 
where 
S 0 
(9 Ce i = ora 0 
) 2 0 sve 0 6.7099 } 


We check these computations by calculating 


(10) RARE”) =(HSr Seg)» TRRM(S2e)=(_1Sp6 741) 
The first vector in (10) corresponds closely to the first vector in (6); in fact, it 
is a slight improvement, for the computation is equivalent to an iteration on 
S,a), The second vector in (10) does not correspond as closely to the second 
vector in (6), One reason is that /, is correct to only four or five significant 
figures (as is v,=/3) and thus the components of S,¢® can be correct to 
only as many significant figures; secondly, the fact that S,c® corresponds 
to the smaller root means that the iteration decreases the accuracy instead of 
increasing it, Our final results are 


(1) (2) 
1, = 0.789, 0.054, 
»_ (0.0566 0.1400 
“= 
(11) e Roe ea 
com Hee sac 
0.0802 }? ~ 0.2619 } 


The larger of the two canonical correlations, 0.789, is larger than any of 
the individual correlations of a variable of the first set with a variable of the 
other. The second canonical correlation is very near zero. This means that to 
study the relation between two head dimensions of first sons and second sons 
we can confine our attention to the first canonical variates; the second 
canonical variates are correlated only slightly. The first canonical variate in 
each set is approximately proportional to the sum of the two measurements 
divided by their respective standard deviations; the second canonical variate 
in each set is approximately proportional to the difference of the two 
standardized measurements. 


508 CANONICAL CORRELATIONS AND CANONICAL VARIABLES 
12.6. LINEARLY RELATED EXPECTED VALUES 


12.6.1, Canonical Analysis of Regression Matrices 


In this section we develop canonical correlations and variates for one 
stochastic vector and one nonstochastic vector, The expected value of the 
stochastic vector is a linear function of the nonstochastic vector (Chapter 8). 
We find new coordinate systems so that the expected value of each coordi- 
nate of the stochastic vector depends on only one coordinate of the non- 
stochastic vector; the coordinates of the stochastic vector are uncorrelated in 
the stochastic sense, and the coordinates of the nonstochastic vector are 
uncorrelated in the sample. The coordinates are ordered according to the 
effzct sum of squares relative to the variance. The algebra is similar to that 
developed in Section 12.2. 

If X has the normal distribution N(w, %) with X, w, and & partitioned as 
in (1) and (2) of Section 12.2 and ps = (’, wp"), the conditional distribu- 
tion of X given x® is normal with mean 


(1) p+ BixO- pp), B=2yzz, 
and covariance matrix 
(2) Zu2 sar Zn ae Zp ey 2o1- 


Since we consider a set of random vectors X{”,..., X() with expected values 
depending on x,..., x (nonstochastic), we can write the conditional 
expected value of X§) as t+ B(x?) —x), where t= pl? + BCE? — p) 
can be considered as a parameter vector. This is the model of Section 8.2 
with a slight change of notation. 

The model of this section is 


(3) EXP=r+B[xP-2), = 1,...,N, 


where x@,...,x9 are a set of nonstochastic vectors (qX1) and #2 = 
N7'24 1x. The covariance matrix is 


(4) E( XP — EXP) XP - EXP) =wW. 


Consider a linear combination of X{, say U,=a'X$). Then U, has 
variance a'Wa and expected value 


(5) EU, = o'r + 0 Bi xP — 2). 


12.6 LINEARLY RELATED EXPECTED VALUES 509 


The mean expected value is (1/N)L3_,€U,= a't, and the mean sum of 
squares due to x is 


1 a rant 1 a ’ 2 =< <(2)\r em: 
(6) mE (8Us—a 1) = no o' B(x? - £2) x® - 2) Bra 


= a'BS,B’a. 


We can ask for the linear combination that maximizes the mean sum of 
squares relative to its variance; that is, the linear combination of dependent 
variables on which the independent variables have greatest effect. We want 
to maximize (6) subject to a'Wa = 1, That leads to the vector equation 


(7) (BS,.B'- «¥)a=0 
for « satisfying 
(8) |BS,.B’ — «Bl =0. 


Multiplication of (7) on the left by a’ shows that a'BS,,.B’a = « for a and 
« Satisfying a'Woa= 1 and (7); to obtain the maximum we take the largest 
root of (8), say x, Denote this vector by a”, and the corresponding random 
variable by U,,= a’ x4. The expected value of this first canonical vari- 
able is SU,,= a) (B(x? ~xO) +7] Let a'B=ky’, where k is 
determined so 


4 


1 ~ 170. 1 . Nye (2 
(9) l=— |] y¥! OS a 
" o=1 n=1 
1 N 
=> 3 (2 — 22) (ae = 2A) Hy 
o=l 
= y Sy, 


Then k = Vk, . Let Uy = y?'"(x — 2), Then &U,, = Vx, uf? + a'r, 

Next let us obtain a linear combination U,= a'X$? that has maximum 
effect sum of squares among all linear combinations with variance 1 and 
uncorrelated with U,,, that is, 0= &(U,— &U,KU,, — €U,4)' =a’ Wa, 
As in Section 12.2, we can set up this maximization problem with Lagrange 
multipliers and find that a satisfies (7) for some « satisfying (8) and 
a'Woa= 1. The process is continued in a manner similar to that in Section 
12.2. We summarize the results. 

The jth canonical random variable is U,, = a”)'X$", where a satisfies 


(7) for k= «, and a’ Wa” =1; «SK, 2+ Bx, are the roots of (8), 


510 CANONICAL CORRELATIONS AND CANONICAL VARIABLES 


We shall assume the rank of Bis p, <p;.(Then «, > 0.) U,, has the largest 
effect sum of squares of linear combinations that have unit variance and are 
uncorrelated with Urgs-+ Ua, g: Let y= (1/ yf Ba, v= a7, and 
Ve = ye?) — x), Then 


(10) EU = V¥%4et ue 
es ee aa 
(11) n ~ [ae ~ Un = 1, 
d= n=l 


N 1 N 1 N 
(12) > la » ont WV sy | =(0, i#], 


db=l ne! 


If py >p,, then yt), .,y'?) can be chosen $0 v4, g = yl Dae — 
BO), Ong = VPP CM — ¥) satisfy (11) and (12). 

Let A= (ql): af? LT, = (yD (Pn), Tl; = (ylPitD 5 yd) A= 
diag(5,,...,5,,) = diagtk, ,....y/k,,), Ug= A'X$), of = TieP — 2), of 
= P(x — 2), and vy = Co? v2), g=1,..., N. Then 


(13) &(U,- 8U,)(U, ~ 8U,)' = AWA=I. 
(14) 8U, = Avy + v, pein 
| N 1 N 1 N 
(15) 7 y |e, WN ams P, [- W y -) =] 
o=1 nel n=l 


The random canonical variates are uncorrelated and have variance 1. The 
expected value of each random canonical variate is a multiple of the corre 
sponding nonstochastic canonical variate plus a constant. The nonstochastic 


canonical variates have sample variance 1 and are uncorrelated in the 
sample. 


If p, > p2, the maximum rank of B is p, and «,.,,; = +: = «x,, = 0. In that 
case we define A, =(a",...,a°?2)) and A, =(a'?2*) ||. @?0), where 
a), ...,a?2) (corresponding to positive «’s) are defined as before and 


o'P2*), x?) are any vectors satisfying a?’ Wa) =1 and a?) Pal = 
0, i#j. Then @Uf0= Ge) +v, i=1,...,p2, and SUf'=v,, i=p,+ 
Lycexa Pie 

In either case if the rank of B is r < min(p,, p,), there are r roots of (8) 
that are nonzero and hence &U{ = 6,v{ + v, for i= 1,...,r. 


12.6 LINEARLY RELATED EXPECTED VALUES 511 


12.6.2. Estimation 


Let x, ..., x@ bea set of observations on X(),..., Xf with the probability 
structure developed in Section 12.6.1, and let x®,...,x% be the set of 
corresponding independent variates. Then we can estimate +, B, and W by 


1X 

(16) t= xd = xO, 
o=1 

(17) B =A, A>! = $,.Sy', 


(18) a FE [2-2-8 (<p -29]] [2p 2-8 [xp -29)] 


— 


=~ (Ay ~ Aj, AZ'A>,) = S$) - Syy S3' Soy, 


where the A’s and S’s are defined as before. (It is convenient to divide by 
n=N-1 instead of by N; the latter would yield maximum likelihood 
estimators.) 

The sample analogs of (7) and (8) are 


(19) 0 =(BS)B’— kW )a 

= | S,.Sx'S2, — k(S,, — S\.Sz3'S,,)]4 
(20) 0 = |BS,B' - k | 

= [$1 859' 8g, — k (Sy) — Sp Sqy'Sa))l- 


The roots k, 2 + 2k, of 20) estimate the roots «, 2 +" 2, of (8), and 
the corresponding solutions @”, ..., a?" of (19), normalized by a" Wa = 1, 
estimate a ?,..., 07), Then 6”) = (1/ fk Ba” estimates y, and n,; = 
a" estimates v,. The sample canonical variates are @ "x? and "(x 
—¥9), j=1,...,P), 6=1,...,N. If p, >po, then p,—-p, more as can 
be defined satisfying a ' Wa) = 1 and a) Wal) = 0, i #). 


12.6.3. Relations Between Canonical Variates 
In Section 12.3, the roots /} > --- >/, were defined to satisfy 


—1S), Si 


S,, —IS) a Ga 1)771?2-?:| $191 | S19 S39'Soy -1s.,]. 


(21) 0-| 


512 CANONICAL CORRELATIONS AND CANONICAL VARIABLES 


Since (20) can be written 
(22) 0=|(1 +k)S\.Sy'So, -kS,|, 


we see that /? =k,/(1 +k,) and k,=1?/(1 - 17), i=1,..., p,. The vector a” 
in Section 12.3 satisfies 


(23) 0 = (S,Sz'S,, - 7S, )a® 
© k, 
= ‘chad T+k. Ene” 
ns | TE [S12Si2'S21 — k(S.- S2Sx So, )] a, 


which is equivalent to (19) for k=k, Comparison of the normaliza- 
tions aS, a =1 and a(S, — $S85'S.)a=1 shows that a= 
(1/Y1-1})a®, Then @ = (1/ Vk, Sz! Sq) a =e. 

We see that canonical variable analysis can be applied when the two 
vectors are jointly random and when one vector is random and the other is 
nonstochastic. The canonical variables defined by the two approaches are the 
same except for norma.ization. The measure of relationship between corre- 
sponding canonical variables can be the (canonical) correlation or it can be 
the ratio of “explained” to “unexplained” variance. 


12.6.4. Testing Rank 


The number of roots «, that are different from 0 is the rank of the regression 
matrix B. It is the number of linear combinations of the regression variables 
that are needed to express the expected values of X{. We can ask whether 
the rank is k (l<k <p, if p, <p.) against the alternative that the rank is 
greater than k. The hypothesis is 


(24) Ay i Kay) = = Kp, = 02 


The likelihood ratio criterion [Anderson (1951b)] is a power of 
(25) Hath) = TI (1+). 
rek+ iz=ke+l 


Note that this is the same criterion as for the case of both vectors stochastic 
(Section 12.4). Then 


Pi 
(26) -[N~-3(p+3)] 22 log(1—/?) 


t=k+1 


126 LINEARLY RELATED EXPECTED VALUES 513 


has approximately the y?-distribution with (p,-kXp,—k) degrees of 
freedom. 

The determination of the rank as any number between 0 and p, can be 
done as in Section 12.4, 


12.6.5. Linear Functional Relationships 


The study of Section 12.6 can be carried out in other terms. For example, the 
balanced one-way analysis of variance can be set up as 


(27) Y,=¥.tertu,,, a=l,.,m, J=Hl,..d, 


where €U,=0, €U,U;=W, Lo_,v, = 9, and 
(28) Ov, =0, w= lag m, 


where @ is q Xp, of rank g (<p,). This is a special case of the model of 
Section 12.6.1 with @=1,...,N, replaced by the pair of indices (a, /), 
XM =Y,,, T= pe, and B(x? —x%)=v, by use of dummy variables as in 
Section 8.8. The rank of (v,,..., v,,) is that of B, namely, r=p,—q. There 
are q roots of (8) equal to 0 with 


(29) BS. B’ = 2S Vy 
a=l 


The model (27) can be interpreted as repeated observations on v, + @ With 
error. The component equations of (28) are the linear functional relation- 
ships. 

Let ¥,=(1/D0j21¥., and ¥=(1/m)L7 | y,. The sum of squares for effect 
is 


(30) H=1Y. (¥.-¥)(F_~F)' = nBS.-B’ 


a=] 


with m—1 degrees of freedom, and the sum of squares for error is 
m i e 

(31) G= s » (Ya; ~ Va) Yay Fa)’ =n 
a=] ;=! 


with m(/ — 1) degrees of freedom. The case p, <p, corresponds to p, </ 
Then a maximum likelihood estimator of @ is 


(32) O = (a9, aleny, 


514 CANONICAL CORRELATIONS AND CANONICAL VARIABLES 
and the maximum likelihood estimators of v, are 
(33) b, = WO'O(y, -¥), a=l,...,”. 


The estimator (32) can be multiplied by any nonsingular g X q matrix on the 
left to obtain another. For a fuller discussion, see Anderson (1984a) and 
Kendall and Stuart (1973). 


12.7. REDUCED RANK REGRESSION 


Reduced rank regression involves estimating the regression matrix B in 
&X|X@) = BX® by a matrix B of preassigned rank k. In the limited-infor- 
mation maximum likelihood method of estimating an equation that is part of 
a system of simultaneous equations (Section 12.8), the regres.ion matrix is 
assumed to be of rank one less than the order of the matrix. Anderson 
(1951a) derived the maximum likelihood estimator of B when the model is 


(1) XO sat B(xP-¥%)4Z,, a= 1,...,N, 


the rank of B is specified to be k (<p,), the vectors x{,...,x% are 
nonstochastic, and Z, is normally distributed. On the basis of a sample 
X1,-...%y, define & by (2) of Section 12.3 and A, A, and f° by @), (4), and 
(5), Partition A=diag(A,,A,), A= (A;, A), and f =(f,,f,), where A,, 


A,, and I’, have & columns. Let ®, = A,U, — A3)~?. 


Definition 12.7.1 (Reduced Rank Regression) The reduced rank regressicn 
estimator in (1) is 


(2) B, = Syx PP) =SyyA, AsV] = 2 oz PPB, 
where B = to and S53 = = ~BSB'. 


The maximum likelihood estimator of B of rank k is the same for X and 
X®) normally distributed because the density of X =(X"’, X®"')' factors as 


(3) n(xlp,Z) = n(x] ne B(x — £2), Xz7)n(x?|p, Lo): 


Reduced rank regression has been applied in many disciplines, including 
econometrics, time series analysis, and signal processing. See, for example, 
Johansen (1995) for use of reduced rank regression in estimation of cointe- 
gration in economic time series, Tsay and Tiao (1985) and Ahn and Reinsel 
(1988) for applications in stationary processes, and Stoica and Viberg (1996) 


12.8 SIMULTANEOUS EQUATIONS MONELS 515 


for utilization in signal processing. In general the estimated reaticed rank 
regression is a better estimator in a regression model than the unrestricted 
estimator. 

In Section 13.7 the asymptotic distribution of the reduced rank regression 
estimator is obtained under the assumptions that are sufficient for the 
asymptotic normality of the least squares estimator B = ee Soh The asymp- 
totic distribution of B, has been obtained by Ryan, Hubert, Carter, Sprague, 
and Parrott (1992), Schmidli (1996), Stoica and Viberg (1996), and Reinsel 
and Velu (1998) by use of the expected Fisher information on the assumption 
that Z, is normally distributed. Izenman (1975) suggested the term reduced 
rank regression. 


12.8. SIMULTANEOUS EQUATIONS MODELS 


12.8.1. The Model 


Inference for structural equation models in econometrics is related to canoni- 
cal correlations. The general model is 


(1) By, +Tz,=u,, t=1,...,T, 


where B is GXG and [ is GXK. Here y, is composed of G jointly 
dependent variables (endogenous), z, is composed of K predetermined 
variables (exogenous and lagged dependent) which are treated as “indepen- 
dent” variables, and u; consists of G unobservable random variables with 


(2) 6u,=0, Su,u,=%. 


We require B to be nonsingular. This model was initiated by Haavelmo 
(1944) and was developed by Koopmans, Marschak, Hurwicz, Anderson, 
Rubin, Leipnik, et al., 1944-1954, at the Cowles Commission for Research in 
Economics. Each component equation represents the behavior of some group 
(such as consumers or producers) and has economic meaning. 

The set of structural equations (1) can be solved for y, (because B is 
nonsingular): 


(3) y, =z, +4, 

where 

(4) =-B'Y, v=B'n, 
with 


(5) &vy,=0, &vv,=B-'3(B’) '=Q, 


516 CANONICAL CORRELATIONS AND CANONICAL VARIABLES 


say. The equation (3) is called the reduced form of the model. It is a 
multivariate regression model. In principle, it is observable. 


12.8.2. Identification by Specified Zeros 


The structural equation (1) can be multiplied on the left by an arbitrary 
nonsingular matrix. To determine component equations that are economi- 
cally meaningful, restrictions must be imposed. For example, in the case of 
demand and supply the equation describing demand m jy be distinguished by 
the fact that it includes consumer income and excludes cost of raw materials, 
which is in the supply equation. The exclusion of the latter amounts to 
specifying that its coefficient in the demand equation is 0. 

We consider identification of a structural equation by specifying certain 
coefficients to be 0. It is convenient to treat the first equation. Suppose the 
variables are numbered so that the first G, jointly dependent variables are 
included in the first equation and the remaining G, = G-— Gy, are not and 
the first K, predetermined variables are included and K,=K-—K, are 
excluded. Then we can partition the coefficient matrices as 


(6) (B ry-[P sane a 


ee 


where the vectors B, 0, y, and 0 have G,, G,, K,, and K, components, 
respectively. The reduced form is partitioned conformally into G, and G, 
sets of rows and K, and K, sets of columns: 


(7) n=|™ a 


The relation between B, I’, and II can be expressed as 
(8) 
Po Serco ft Be-fem 


The upper right-hand corner of (8) yields 
(9) B'II,, = 90. 


To determine B (G, X 1) uniquely except for a constant of proportionality we 
need 


(10) rank(I1,.) =G, - 1. 


12.8 SIMULTANEOUS EQUATIONS MODELS 517 
This implies 

(11) K,2G,-1. 

Addition of G, to (11) gives the order condition 

(12) G,+K,2>G,+G,-1=G-1. 


The number of specified 0’s in an identified equation must be at least equal 
to 1 less than the number of equations (or jointly dependent variables). 

It can be shown that when B is nonsingular (10) holds if and only if the 
rank of the matrix consisting of the columns of (B I’) with specified 0’s in the 
first row is G — 1. 


12.8.3. Estimation of the Reduced Form 


The model (3) is a typical multivariate regression model. The observations 
are 


. eh tth 


The usual estimators of IT and 2 (ection 8.2) are 


a T aad 
(14) P= Evai| Ea} 5; 
t=l t=] 
eet ee 
(15) Q= 5 = Py Pe)" 


These are maximum likelihood estimators if the v, are normal. 
If the z, are exogenous (regardless of normality), then 


(16) €vecP=vecll, (vec P)=A™~'@Q, 
where 

- 
(17) A= }z,z' 


(=] 


and vec(d,,...,d,,) =(d},...,d/,)'. If, furthermore, the rv, are normal, then 
P is normal and TQ has the Wishart distribution with covariance matrix 2 
and T—K degrees of freedom. 


518 CANONICAL CORRELATIONS AND CANONICAL VARIABLES 


12.8.4, Estimation of the Coefficients of an Equation 


First. consider the estimation of the vector of coefficients B when K,= 
G,—t. Let 


(18) P= 


be partitioned as II, Then the probability is 1 that rank(P,.) = G, — 1 and 
the equation 


(19) B'P,, =0 


has a nontrivial solution that is unique except for a constant of proportional- 
ity. This is the maximum likelihood estimator when the disturbance terms are 
normal. 

If K, > G,, then the probability is 1 that rank(P,,) = G, and (19) has only 
the trivial solution B= 0, which is unsatisfactory. To obtain a suitable 
estimator we find 6 to minimize B’P,, in som: sense relative to another 
function of p’. 

Let z, be partitioned into subvectors of K, and K. components: 


(20) 2 = zt 

Sion — ra * 

(21) ee = Ay, Ay 
in Ay, Any} 

(22) Ay) =A — Ay, Aj An. 


Let y, and $2 be partitioned into G, and G, components: 
cy \ 
¥, 
ft 


(24) a=| a" a 


12.88 SIMULTANEOUS EQUATIONS MODELS 519 


Now set up the multivariate analysis of variance table for y“: 


Source Sum of Squares 
T 
1 A(t 4-lo(l) yl 
zy A Ag ae ee 
5,4=1 
2 l " 
ay Le gy Py Aq. Pin 
T 
Error ey HP) = Pz GY = Piz) = Paz)’ 
=1 
‘dg 
Total Ye yD yD? 


t=] 


The first term in the table is the (vector) sum of squares of y“ due to the 
effect of z{, The second term is due to the effect of z{?) beyond the effect of 
z). The two add to (PAP’),,, which is the total effect of z,, the predeter- 
mined variables. 

We propose to find the vector B such that effect of 2°? and B’y™ beyond 
the effect of z‘” is minimized relative to the error sum of squares of B’y, 
We minimize 


B'( Pi Sx, Pi2)B 2 (BP, Sy.,(B’P 2)! 


(25) A A A A A 
B'2),B B'O.,.B 


? 


where TQ = L!_19,¥, — PAP'. This estimator has been called the least vari- 
ance ratio estimator. Under normality and based only on the 0 restrictions on 
the coefficients of this single equation, the estimator is maximum likelihood 
and is known as the limited-information maximum likelihood (LIML) estimator 
[Anderson and Rubin (1949)]. 


The algebra of minimizing (25) is to find the smallest root, say v, of 
(26) | Piz Sop. Pig — AD, =0 
and the corresponding vector satisfying 
(27) P12 8y9., Pi. B = vQ,,B. 


The vector is normalized according to some rule. A frequently used rule is io 


520 CANONICAL CORRELATIONS AND CANONICAL VARIABLES 


set one (nonzero) coefficient equal to 1, say the first, 6 | = 1. If we write 


1 - 1 
(28) = [pe] 8~ [5] 
Ti2 Py2 
(29) n= (ni I, ri * |? 
12 12 
Ze Bi) 
(30) On=]. G+ |? 
Oi) ll 
then (27) can be replaced by the linear equation 
(31) (PBS Ph’ Be Xi, )B* ae (Pi So. Pie = vi). 


The first component equation in (27) has been dropped because it is linearly 
dependent on the other equations [because vy is a root of (26)]. 


12.8.5. Relation to the Linear Functional Relationship 


We now show that the model for the single linear functional relationship 
(q = 1) is identical to the model for structural equations in the special case 
that G, = 0 Cy! =y,) and z{) = 1 (K, = 1). Write the two models as 


(32) X,=e+v,+U,,, a=1,...,.n, j=l,...,k, 
where 
n 
(33) Lv, =9, 
a=] 
and 
(34) y, = TI, + I1,2 +, i= eer on 


where IT = (II, II,), The correspondence between the models is po G=G,, 


(35) Xo ane) U,; ou, 
(36) (a,jyet, nko, 
(37) wo, poll. 


We can write the model for the linear functional relationship with dummy 


12.8 SIMULTANEOUS EQUATIONS MODELS 521 


variables. Define 


0 
(38) Sq; = | 1] < ath position, a=1,..., a-l, 
0 
~-1 
(39) = saa (eae 
=i 
Then 
1 
(40) pore (Heed) |s i a=1,...,7, 
a) 


where j may be suppressed. Note 
(41) WH AA +¥,-1)- 


The correspondence is 


(42) Loz), Cg ae 
(43) we IL, (¥,,-.-5¥%-1) eh, 
(44) 1+K,, n-1le¢K;, 
(45) B(v),...,¥,-;) =0° BTL, =0. 


Let P=(P, P,). In terms of the statistics we have the correspondence 


(46) pare, 


(47) o,=% 


The effect matrix is 


n 
(48) H=k 3 (¥,-3)(%,— 8) © PpAy Ph, 


a=] 


522 CANONICAL CORRELATIONS AND CANONICAL VARIABLES 


and the error matrix is 


T 
(49) G= > (ae X,)(%q, —¥.) @ TH = LY (y,- Pz.) y,- Pz)! 


a=] y=1 i=] 


Then the estimator B of the linear functional relationship for q=1 is 
identical to the LIML estimator [Anderson (1951b), (1976), (1984a)]. 


12.8.6. Asymptotic Theory as T > oo 


We shall find the limiting distribution of VT (B* — B*) defined by (28) and 
(31) by showing that B* is asymptotically equivalent to 


= A ! -| ! 
(50) Tsis = (PH Sy.Ph) PHSy.\P\2- 


This derivation is essentially the same as that given in Anderson and Rubin 
(1950) except for notation. The estimator defined by (50), known as the two 
stage least squares (TSLS) estimator, is an approximation to the LIML 
estimator obtained by dropping the terms vO* 1; and v@.) from (31), Let 
B* = B*),. We assume the conditions for /T(P— I) having a limiting 
normal distribution. (See Theorem 8.11.1.) 


Lemma 12.8.1. Suppose (1/T)A — A°, a positive definite matrix, as T > 
x, Then v= 0,(11/T), where v is the smallest root of (26). 


Proof. Let P,, = VT (P, — 1,). Then because B’T1,, = 0 


(51) BPS 1Pi2B B[,, + (1/vT) P, | Soo [Da + (1/VT ) Py] B 
pO,,B p'O,,B 
TB’D,,.B 0, 7}: 
Since 
(52) y= min BPpSpiPiB Pi2SyiPrB — BP S21 Pi2B 


6 8 pO,B = BOB 


the lemma follows. | 


12.8 SIMULTANEOUS EQUATIONS MODELS 523 


The import of Lemma 12.8.1 is that the difference between the LIML 
estimator and the TSLS estimator is O,(1/T). We have 


(53) Bim i Bists 
wal t 
= (PE Sy.) Piz) Pi So -1 Pip 


A ~l,: i 
= (PSn Pi —vQ)) (PES. Pin — V®y) 


i 


! -1 ! A “1 i 
[CPi S2. Ps — (PESy.,PH — vQ,)) | Pi S.2.Pis 


ae ae 
+ (PS. Pi VO) vO, 


% -LA ' A -1 t 
— vf Ph Sy Piz) 014, ( Pi Sop 1 Ps a vQ,,) P15 Soo. Pie 


A -I 
* * A 
+ (PSS. Ph — vQ,,) Ooty 


1 
= 0,(v) = 0,(-7). 
Consider 


T T 
‘pe — pi p= 4-! PV ec wal 2+1 
(54) Pip + Pi B* = Pi.B = Az), Lz! MOB = AD), D2? Pu, 


f=] i=] 
where 22°) = 79 — 4, Ajj'z). Thus &(p}, + PE B*) = &P|,B = 0 and 
(55) &( Pia + PI B*)( Pia + Pi BY = EP}, B( Pi2BY = OAD. 


Note that oes — B* = —(PES».,P5)'PES».,P,B and (Bp, 0)y, + 
(y’,0)z, = 4y,. 


Theorem 12.8.1. Under the conditions of Theorem 8.11.1 
A d ae | 
(56) VT (Bi: — BY) > N{o, O1, (TT, Sy.) 12) I 


Proof. The theorem follows from (55), S..,— $3,.,, and P,—> IT,,. 
| 


Because of the correspondence between the LIML estimator and the 
maximum likelihood estimator for the linear functional relationship as out- 
lined in Section 12.7.5, this asymptotic theory can be translated for the latter. 


524 CANONICAL CORRELATIONS AND CANONICAL VARIABLES 


Suppose the single linear functional relationship is written as 


(57) O=B'v,=(1 B*’) - Vig + BPM, a= 1.5m, 
where 

Vv 
(58) v= | a@=1,...,n. 


Let n(@ K) be fixed, and let the number of replications k — % (correspond- 
ing to T/K — & for fixed K). Let c? = B'WB. 
Since [1,,45).,IT,, corresponds to KL", v* v*’, B* here has the approxi- 


mate distribution 
-1 
*o7(k 5 >» vty = : 
a=] 


Although Anderson and Rubin (1950) showed that vO}, and v@,,, could 
be dropped from (31) defining Be var and hence that Bisis was asymptoti- 
cally equivalent to Brae they did not explicitly propose B* “srs: [AS part of 
the Cowles Commission program, Chernoff and Divinsky (1953) developed a 
computational program of {1 ;4;-.] The TSLS estimator was proposed by 
Basmann (1957) and Theil (1961), It corresponds in the linear functional 
relationship setup to ordinary least squares on the first coordinate. If some 
other coefficient of B were set equal to one, the minimization would be in 
the direction of that coordinate. 

Consider the general linear functional relationship when the error covari- 
ance matrix is unknown and there are replications. Constrain B to be 


(59) NIB 


(60) B=(I, B*), 


y 
y | 


Then the least squares estimator of B* is 


Partition 


(61) v= 


a 


-1 
(62) B*, = 7 (#0) -20)(22 2) | . (x — $2) (xO — xO) 


a= | t=] 


12,8 SIMULTANEOUS EQUATIONS MODELS 525 


aA ‘] 
For n fixed and k co and B¥, > B* and 


" ~i 
(63) Vk vec BY. ~ B*) > nfo, | VY yO ypO 


a=t 


onus 


[See Anderson (1984b).] It was shown by Anderson (1951c) that the g 
smallest sample roots are of such a probability order that the maximum 
likelihood estimator is asymptotically equivalent, that is, the limiting distribu- 
tion of yk vec( B*,, — B*) is the right-hand side of (63). 


12.8.7. Other Asymptotic Theory 


In terms of the linear functional relationship it may be moe natural to 
consider 7 — © and k fixed. When k = 1 and the error covariance matrix is 
o°I,, Gleser (1981) has given the asymptotic theory. For the simultaneous 
equations modu, the corresponding conditions are that K,— ~, 7 — x, and 
K,/T approaches a positive limit. Kunitomo (1980) has given an asymptotic 
expansion of the distribution in the case of p=2 and m=q=1., 

When n>, the least squares estimator (i.e., minimizing the sum of 
squares of the residuals in one fixed direction) is not consistent; the LIML 
and TSLS estimators are not asymptotically equivalent. 


12.8.8. Distributions of Estimators 


Kconometricians have studied intensively the distributions of TSLS and 
LIML estimator, particularly in the case of two endogenous variables, 

Exact distributions have been given by Basmann (1961),(1963), Richardson 
(1968), Sawa (1969), Mariano and Sawa (1972), Phillips (1980), and Anderson 
and Sawa (1982), These have not been very informative because they are 
usually given in terms of infinite series the properties of which are unknown 
or irrelevant. 

A more useful approach is by approximating the distributions. Asymptotic 
expansions of distributions have been made by Sargan and Mikhail (1971), 
Anderson and Sawa (1973), Anderson (1974), Kunitomo (1980), and others. 
Phillips (1982) studied the Padé approach, See also Anderson (1977). 

Tables of the distributions of the TSLS and LIML estimators in the case 
of two endogenous variables have been given by Anderson and Sawa 
(1977), (1979), and Anderson, Kunitomo, and Sawa (1983a). 

Anderson, Kunitomo, and Sawa (1983b) graphed densities of the maxi- 
mum likelihood estimator and the least squares estimator (minimizing in one 
direction) for the linear functional relationship (Section 12.6) for the case 


526 CANONICAL CORRELATIONS AND CANONICAL VARIABLES 


p=2,.m=qz=1,WV=o°W, and for various values of B,n, and 


(6+) 5° = ? (Ha -B)- 


PROBLEMS 


12.1. (Sec. 12.2) Let z,=2z,,=1,a=1,...,n,and B= B. Verify that «0 = x7'g. 
Relate this result to the discriminant function (Chapter 6). 


12.2. (Sec. 12.2) Prove that the roots of (14) are real. 
12.3. (Sec. 12,2) 


(a) Lee Xo = (XO XO) eX =0, 


En Lp 
& XX" = 5 
& Ln» 


U=a'X) Vay XO, €U2=1= 86V", where a and y¥ are vectors. 
Show that choosing @ and y to maximize &UV is equivalent to choosing 
a and y to minimize the generalized variance of (U V), 

(b) Let X'=(X¥M! YQ" xO) ex =0, 


Zay Xi. Lis 
EXX'=Y=|X., UZ. Las], 
2st 240 X43 


Ua=a' XO, Vey’ XO), W= BX, U2 = V2 = EW*=1., Consider 
finding a,y,B to minimize the generalized variance of (U,V,W). Show 


that this minimum is invariant with respect to transformations X*) = 
A,X, \Aj| #0. 


(c) By using such transformations, transform 2 into the simplest possible 
form. 


(d) In the case of X“ consisting of two con:ponents, reduce the problem (of 
minimizing the generalized variance) to its simplest form. 
(e) In this case give the derivative equations, 


(f) Show that the minimum generalized variance is 1 if and only if 24. =0, 
13 = 0, 223 = 0. (Note: This extension of the notion of canonical variates 
does not lend itself to a “nice” explicit treatment.) 


12.4. (Sec. 42.2) Let 


X= AZ+ YO, 
XO) = BZ +¥®, 


PROBLEMS : 527 


12.5. 


12.6. 


12.7. 


12.8. 


12.9. 


12.10. 


12.11. 


12.12, 


12.13. 


where ¥, y®, Z are independent with mean zero and covariance matrices I 
with appropriate dimensionalities. Let 4 =(a,,...,a,), B=(b,,...,5,), and 
Suppose that A'A, B’B are diagonal with positive diagonal elements. Show 
that the canonical variables for nonzero canonical correlations are propor- 
tional to aX, b/X@, Obtain the canonical correlation coefficients and ap- 
propriate normalizing coefficients for the canonical variables. 


(Sec, 12.2) Let Ay >A, = ‘+ 2A, > 0 be the positive roots of (14), where £,, 
and X.,. are q Xq nonsingular matrices. 


(a) What is the rank of ¥,,? 


(b) Write [1%_, A? as the determinant of a rational function of 21, Zp, ZL, 
and Xo. Justify your answer. 


(c) If A, = 1, what is the rank of 
x II x 12 » 
> Yn , 


(Sec. 12.2) Let £4,;=(—-gM, +g, €,, 29 -Q-AM, +he, €, Lp = 


Pr“ p2 


ke,,e',,, where -1/(p, - 1) <p <1, nen? -I)<h< i and k is Suitably 
restricted, Find the canonical correlations and variates. What is the appropri- 
até restriction on k? 


(Sec. 12.3) Find the canonical correlations and canonical variates between 
the first two variables and the last three in Problem 4.42. 


(Sec. 12.3) Prove directly the sample analog of Theorem 12.2.1. 


(Sec. 12.3) Prove that A?(i +1) A? and a(i + 1) > a if a(0) is such that 
a'(0)E a") # 0, | Hint: Us: Si hima Zo =A AZA7 hy 


(Sec. 12.6) Prove (9), (10), and (11). 


Let Ay 2A, > * BA, be the roots of |Z, — AZ| = 0, where Z| and Z, are 
qXq positive definite covariance matrices. 


(a) What does A; =A, =1 imply about the relationship of &, and £,? 


(b) What does A, >1 imply about the relationships of the ellipsoids x’ Zy'x 
=e and x XS ly=e? 


{c) What does A >1 and A, <1 imply about the relationships of the ellip- 
soids x'Zp'x=c and x Xs lx=¢? 


(Sec. 12.4) For q=%2 express the criterion (2) of Section 9.5 in terms of 
canonical correlations, 


Find thé canonical correlations for the data in Problem 9.11. 


CHAPTER 13 


The Distributions of Characteristic 
Roots and Vectors 


13.1. INTRODUCTION 


In this chapter we find the distribution of the sample principal component 
vectors and their sample variances when all population variances are 1 
(Section 13.3). We also find the distribution of the sample canonical correla- 
tions and one set of canonical vectors when the two sets of original variates 
are independent. This second distribution will be shown to be equivalent to 
the distribution of roots and vectors obtained in the next section. The 
distribution of the roots is particularly of interest because many invariant 
tests are functions of these roots. For example, invariant tests of the general 
linear hypothesis (Section 8.6) depend on the sample only through the roots 
of the determinantal equation 


(1) |(Bio ~ BY) 411-2(Bia - BY)’ — IN E_|=0. 


If the hypothesis is true, the roots have the distribution given in Theorem 
13.2.2 or 13.2.3. Thus the significance level of any invariant test of the 
general linear hypothesis can be obtained from the distribution derived in the 
next section. If the test criterion is one of the ordered roots (e.g., the largest 
root), then the desired distribution is a marginal distribution of the joint 
distribution of roots. 

The limiting distributions of the roots are obtained under fairly general 
conditions, These are needed to obtain other limiting distributions, such as 
the distribution of the criterion for testing that the smallest variances of 


An Introduction to Multivariate Statistical Analysis, Third Edition. By T. W. Anderson 
ISBN 0-471-36091-0 Copyright © 2003 John Wiley & Sons, Inc. 


328 


13.2 THE CASE OF TWO WISHART MATRICES 529 


principal components are equal. Some limiting distributions are obtained for 
elliptically contoured distributions. 


13.2. THE CASE OF TWO WISHART MATRICES 


13.2.1. The Transformation 


Let us consider A* and B* (p xp) distributed independently according to 
W(X, m) and WC, n) respectively (m, > p), We shall call the roots of 


(1) |A* — 1B*| = 0 
the characteristic roots of A* in the metric of B* and the vectors satisfying 
(2) ( A* - 1B*) x* =0 


the characteristic vectors of A* in the metric of B*. In this section we shall 
consider the distribution of these roots and vectors. Later it will be shown 
that the squares of canonical correlation coefficients have this distribution if 
the population canonical correlations are all zero. 

First we shall transform A* and B* so that the distributions do not 
involve an arbitrary matrix %. Let C be a matrix such that CLC’ =I. Let 


(3) A=CA*C',  B=CB*C'. 


Then A and B are independently distributed according to W(UZ,m) and 
W(I, n) respectively (Section 7.3.3). Since 


|A — 1B] = |CA*C' — ICB*C' | 
~|C( A* — 1B*)C'| = |C| -| A* — 1B*|-1C'|, 
the roots of (1) are the roots of 
(4) |A ~ 1B| = 0. 


The corresponding vectors satisfying 


(5) (A-/B)x=0 
satisfy 
(6) 0=C7'(A-IB)x 
= C7'(CA*C' — ICB*C')x 
= ( A* —/B*)C’x. 


Thus the vectors x* are the vectors C'r. 


530 THE DISTRIBUTIONS OF CHARACTERISTIC ROOTS AND VECTORS 
It will be convenient to consider the roots of 

(7) |4 -f(A+B)| =0 

and the vectors y satisfying 

(8) [A-f(A+B)]y=0. 

The latter equation can be written 

(9) 0=(A-—fA-fB)y=|[(1-f)A-fB]y. 


Since the probability that f= 1 (i.e., that | ~B| = 0) is 0, the above equation 
is 


00) (eran 


Thus the roots of (4) are related to the roots of (7) by /=f/(1~f) or 
f=1/( +1), and the vectors satisfying (5) are equal (or proportional) to 
those satisfying (8). 

We now consider finding the distribution of the roots and vectors satisfy- 
ing (7) and (8). Let the roots be ordered f,>/f,>--- >f,>0 since the 
probability of two roots being equal is 0 [Okamoto (1973)]. Let 


f, 0 0 

0 f: 0 

(11) F= ; 
0 0 f, 


Suppose the corresponding vector solutions of (8) normalized by 


(42) y(A+B)y= | 
are y\,...,y,. These vectors must satisfy 
(3) y(A+B)y,=0, 


because y| Ay, = f,y(A +B)y, and y, Ay, =f,y/(A +B)y,, and this can be only 
if (13) holds ( f, # f}). 
Let the p x p matrix Y be 


(14) YS Vinci Dy) 


13.2 THE CASE OF TWO WISHART MATRICES 531 


Equation (8) can be summarized as 


(15) AY=(A+B)YF, 
and (12) and (13) give 
(16) Y’'(A+B)Y=I. 


From (15) we have 
(17) Y'AY= Y'(A+B)YF =F. 


Multiplication of (16) and (17) on the left by (¥’)~' and on the right by Y7! 
gives 


A+B=(¥') TY", 


(18) 
A=(¥')' FY", 
Now let ¥Y~! = E. Then 
A+B=E'E, 
(19) A=E'FE, 
B-E(I-F)E. 


We now consider the joint distribution of E and F. From (19) we see that 
E and F define A and B uniquely. From (7) and (11) and the ordering 
fi >- >f, we see that A and B define F uniquely. Equations (8) for f = f; 
and (12) define y, uniquely except for multiplication by —1 (ie., replacing y, 
by -y,). Since YE = J, this means that E is defined uniquely except that rows 
of E can be multiplied by — 1. To remove this indeterminacy we require that 
€,, 2 0. (The probability that e,, = 0 is 0.) Thus E and F are uniquely defined 
in terms of A and B. 


13.2.2. The Jacobian 


To find the density of E and F we substitute in the density of A and B 
according to (19) and multiply by the Jacobian of the transformation. We 
devote this subsection to finding the Jacobian 


(20) Fert 


a(E, F) 


Since the transformation from A and B to A and G=A+B8 has Jacobian 
unity, we shall find 


§32 THE DISTRIBUTIONS OF CHARACTERISTIC ROOTS AND VECTORS 


First we notice that if x, =f,(y,,..-,¥,), @=1,...,m, is a one-to-one 
transformation, the Jacobian is the determinant of the linear transformation 


(22) dx, = “23 Sh bp, 


where dx, and dy, are only formally differentials (i.e., we write these as a 
mnemonic device). If f,(y,,-.-,¥,) iS a polynomial, then of,/dy, is the 
coefficient of yi in the expansion of f,(y,+yf.--->¥.+y¥*) [in fact the 
coefficient in the expansion of f,(y,,-.-3 ¥g-1 Yat Yes Yasie++ +2 ak 
The elements of A and G are polynomials in E and F. Thus the Cerivative of 
an element of A is the coefficient of an element of E* and F* in the 
expansion of (E + E*)'(F + F* \(E + E*) and the derivative of an element of 
G is the coefficient of an elemcnt of E* and F* in the expansion of 
(E+ E*)(E+E*). Thus the Jacobian of the transformation from A,G 
to E, F is the determinant of the linear transformation 


(23) dA = (dE)'FE + E'(dF)E+E'F(dE), 
(24) dG = (dE)'E+E'(dE). 
Since A and G (dA and dG) are symmetric, only the functionally indepen- 
dent component equations above are used. 
Multiply (23) and (24) on the left by E’~' and on the right by E~! to 
obtain 
(25) E'"'(dA)E~'=E'"'(dE)'F +dF + F(dE)E™', 
(26) E'~'(dG)E~'=E'~'(dE)' + (dE)E™! 
It should be kept in mind that (23) and (24) are now considered as a linear 


transformation without regard to how the equations were obtained. 
Let 


(27) E‘-'(dA)E~' =dA, 
(28) E'~'(dG)E-| =dG, 
(29) (dE) E~' = dW. 
Then 

(30) dA =(dW)'F +dF+F(dW), 


(31) dG =dw' + dw. 


13.2 THE CASE OF TWO WISHART MATRICES 533 


The linear transformation from dE, dF to dA, dG is considered as the linear 
transformtion from dE,dF to dW,dF with determinant |E~'|?=|E|~” 
(because each row of dE is transformed by E~'), followed by the linear 
transformation from dW, dF to dA, dG, followed by the linear transformation 
from d4,dG to dA=E‘(dA)E,dG=E'(dG)E with determinant |E|’*'- 
|E|?*' (from Section 7.3.3); and the determinant >f the linear transformation 
froom dE,dF to dA,dG is the product of the determinants of the three 
component transformations. The transformation (30). (31) is written in com- 
ponents as 


da,, = df, + 2f, dw,,, 


da,, =f, dw, +f, dw, ,, i<j, 
(32) 

dg, =x 2dw,,, 

dg, =dw,, + dw,,, i<j. 


The determinant is 


df, dw, dw, (i<j) dw, (i>]) 


da, I 2F 0 0 
(33) dg, 6 21 0 0 
di, (i<j) | 0 M N 
dg, (i<j) |0 0 I I 


I 2F\IM N 
= . = 2° AM ~ 
i Ue yates LM — NI, 
where 
(34) dw, - dw,, dw; -* dwy, dw.) » 
day f Goes ee Oe | O 
: NA a 
: ar as 
da,, | 0 fr 1 0 oO: 1 0 
da, |O | Og ae OL os 1 0) 
M = ae is 
7 1 i} 1 ' 
da,, 0 0: 0 fo! 1 Q 
meee | oo eR ee ee ee ee leSiece ee Picea Ss 
* I 1 1 
. ] i t 
foie aon eace= sae a= ies cas aaa tL ---~1---~ 
a oe 0 0 ! 0 0 | fp-t 


534 THE DISTRIBUTIONS OF CHARACTERISTIC ROOTS AND VECTORS 


and 
(35) dws, dw,, dws dw. dw, y-1 
day fx >" 0, 0 Si 0 | ; 0 
Boe _ 
eS I 1 | . 
da,, | 0 a ee | 0 
da, | O | Cau s “Oo, “Ge 
N= ‘ is ae a: 
: aya ae 
day, 0 0 ' 0 te 4 ' 0 
eleetaetaeiadieeinateeieetadeetetedtied Fo----4---- 
. RS 1 | [ 
= * | | [ 
<-~-=~|- ae passage a eee 
| 0 0 0 res 0 } i 
Then 
(36) IM-N| = [I(f,-f)- 


The determinant of the linear transformation (23), (24) is 


(37) fel Pelt el*'2°T 1 (4-4) = 2°18? 1 -f). 
tJ 


rc) 


Theorem 13.2.1. The Jacobian of the transformation (19) is the absolute 
value of (37). 


13.2.3. The Joint Distribution of the Matrix E and the Roots 
The joint density of A and B is 
(38)  w( ANF, m)w( BIE, n) = C,{AL" PD] BI Bre Dem BA TB) 
where 
(39) C,= [2ietnemr (Ln), (dm)] 
Therefore the jomt density of E and F is 
(40) CE FE|#" "EU FYE) 
7 WEED? BE’ BI p42) TT] (fF; —f,). 


i<g 


13.2 THE CASE OF TWO WISHART MATRICES 535 


Since |E’FE| =|E'|+|F|-|E| =|F|-|E’E| =T12, f|E'El and |E’U — F)E| 
=|I—-F|-|E’&| =T12.,0 —-f)|E' 4, the density of E and F is 


(41) 


p p : 
2°C, |E'E| Un trp gM ETT pilm-p-OTT (1 — Ar PTL; ~-f;). 
i=1 i=l 


i<j 


Clearly, E and F are statistically independent because the density factors 
into a function of E and a function of F. To determine the marginal 
densities we have only to find the two normalizing constants (the product of 
which is 2?C,). 

Let us evaluate 


(42) 2? [|B E| Km*e-p)e- bree gp, 


where the integration is 0 < e,; < 00, —00 <e,, < 00, | + 1. The value of (42) is 
unchanged if we let — oo <e,;, < co and multiply by 27”. Thus (42) is 


ed & 1 1 
43 27)” ts |E’E| m+a-p) ex [-3 ei] de; ,. 
(43) Qa)" fof aay wr |~ 7 Le || Ede, 


aa 


Except for the constant (27)”", (43) is a definition of the expectation of the 
(m +n—p)th power of |E’E| when the e,, have as density the function 
within brackets. This expected value is the 4(m-+mn—p)th moment of the 
generalized variance |E'E| when E’E has the distribution W(J, p). (See 
Section 7.5.) Thus (43) is 


2 T,[4(m + n)] 
r,(zp) 


Qipln+ mp) 


(44) (27)” 


Thus the density of E is 


T(3P) 


rye d(nta—p),— 4 BE 
ec eee oe a eee oe lk € . 
2v(msn-D eT [1m +n)] 


(45) 


The density of f, is (41) divided by (45); that is, the density of f, is 


Pp Pp 
(46) cer PTT a a 0 (fi-f,) 


536 THE DISTRIBUTIONS OF CHARACTERISTIC ROOTS AND VECTORS 


for0<f,< --- <f, <1, where 


4,2 
qp2P 


oo eT [a(n +2) 
> T,(n)P,(Gm)0,(zP) | 


The density of J, is obtained from (46) by letting 


(47) 


I 
(48) heyy: 
we have 
df, 
1 (41)°' 
49 Dee 
ad) f t- T¥ijG, 41)" 
1 
ae ee i 
Thus the density of J, is 
a 1 E — niin) 
(50) CTU PTT dey TT 4) 
=I i=] i<fs 


for0<I,<-- <h. 


Theorem 13.2.2. Jf 4 and B are distributed independently according to 
W(X, m) and WCE, n) respectively (m = p, n > p), the joint density of the roots 
of |A —1B| =0 is (50) where C, is defined by (47). 


The joint density of ¥ can be found from (45) and the fact that the 
Jacobian is |¥|~?”. (See Theorem A.4.6 of the Appendix.) 


13.2.4. The Distribution for A Singular 


The matrix 4 above can be represented as 4 = W,Wi, where the columns of 
W, (p Xm) are independently distributed, each according to N(0, 2). We 
now treat the case of m<p. If we let B+ WWW) =G=CC' and W, =CU, 
then the roots of 


(51) O=|4—-f(44+B)|=|W,Wi -fG| 
= |CUU'C' — fCC'| = |C|-|UU' — ff, -|C| 


13.2. THE CASE OF TWO WISHART MATRICES 537 
ate the roots of 
(52) |UU' =i, =(. 


We shall show that the nonzero roots f, > «+: >f,, (these roots being distinct 
with probability 1) are the roots of 


(53) lu'u —f1,| =0. 
For each root f* 0 of (52) there is a vector x satisfying 
(54) (UU' — ff, )x=0. 
Multiplication by U' on the left gives 
(55) 0=U'(UU' —f,)x 
= (U'U ~ fl,)(U'x). 
Thus U'x is a characteristic vector of UU’ and f is the corresponding root. 
It was shown in Section 8.4 that the density of U= U, is (for 1, - UU’ 
positive definite or I,, - U * Uy positive definite) 
(56) K|L, — Uu'|e" PP = KL. — Ue UR, 


where p* =m, n* —p*-1=n—p-~—1, and m* =p. Thus f,,...,f,, must be 
distributed according to (46) with p replaced by m, m by p, and n by 
n+m — p, that is, 


mT [i(m+n)] 


(57) Ta(Em)T,L3(m +0 —p)IT,,(4P) 
TI[ser-ra ape ITs —f,). 


Theorem 13.2.3. If A is .listributed as W,W;, where the m columns of W, 
are independent, each distribuied according to N(0, &%), m < p, and B is indepen- 
dently distributed according to W(X ,n), n> p, then the density of the nonzero 
roots of |A — f(A +B)| = is given by (57). 


These distributions of roots were found independently and at about the 
same time by Fisher (1939), Girshick (1939), Hsu (1939a), Mood (1951), and 
Roy (1939). The development of the Jacobian in Section 13.2.2 is due mainly 
to Hsu [as reported by Deemer and Olkin (1951)]. 


538 THE DISTRIBUTIONS OF CHARACTERISTIC ROOTS AND VECTORS 
13.3. THE CASE OF ONE NONSINGULAR WISHART MATRIX 


In this section we shall find the distribution of the roots of 
(1) |4 -~- Z| = 0, 


where the matrix 4 has the distribution W(Z, n). It will be observed that the 
variances of the principal components of a sample of n + 1 from N(p, I) are 
1/n times the roots of (1). We shall find the following theorem useful: 


Theorem 13.3.1. Jf the s;mmetric matrix B has a density of the form 
GL eave ps wherel, > --- >l p are the characteristic roots of B, then the density 
of the roots is 


mw?’ g(1y,..-,t,)E1, <, (4, -1,) 
2 a aa sa 
T,(3P) 


Proof. From Theorem A.2.1 of the Appendix we know that there exists an 
orthogonal matrix C such that 


(3) B=C'LC, 
where 

I, 0 0 

0. , ofs 0 
(4) L= : 

0 0 I 


If the /’s are numbered in descending order of magnitude and if c,, > 0, then 
(with probability 1) the transformation from B to L and C is unique. Let the 


matrix C be given the coordinates c),...,¢,~1)/2) and let the Jacobian of 
the transformation be f(L,C). Then the joint density of LZ and C is 
g(l,...,2 )fCL,C). To prove the theorem we must show that 
1,2 
we ITje (i,-1 ) 
(5) [> [fl.c) ISS To ama 


T,(zP) 


We show this by taking a special case where B = UU’ and U(p xm, m2=p) 
has the density 


ntmp tol 2 (om + n)| 


” F,(in) 


P 


|f— uur ere“), 


13.3 THE CASE OF ONE NONSINGULAR WISHART MATRIX 539 
hen by Lemma 13,3.1, which will be stated below, B has the density 


P,[2(m +n)] 


I-B i(n-p-l)! p Mim -p -1) 
(7) r (barn) | | B| 


r{3 s(m+n)] 2 = 4ln-p-1) 34m —p-1 
Tie ee 


= g*(l,,-.-s1,)- 


The joint density of L and C is f(L,C)g*(!,,...,/,). In the preceding section 
we proved that the marginal density of L is (50). Thus 


(8) fo fa*(lseeslp)FLsC) dC = 8" (ly--04t) f > ff(L,€) ac 


m TH; - 1) “(1 / ) 
Dap 2 eee 


This proves (5) and hence the theorem. a 
The statement above (7) is baseJ on the following lemma: 


Lemma 13.3.1. If the density of ¥Y (p x m) is f(YY'), then the density of 
B=YY' ts 


| B| Mm-p-1 F( B) em 
7 T(z) 


The proof of this, likc that of Thcorem 13.3.1, depends on exhibiting a 
special case; let f(YY’) = (27) wm e- HY’ then (9) is w( BIJ, m). 


Now let uz find the density of the roots of (1). The density of A is 


l &. ayy din —p- 
Jal Here Der tea Tp ere Dexp(- Shh 1) 


(10) Lon = don 
2”"T (5n) 2”"T (5n) 


Thus by the theorem we obtain as the density of the roots of A 


mer TH Pag OPP exp( = EA (4-4) 


- 2”"T,(an)0,(zP) 


540 THE DISTRIBUTIONS OF CHARACTERISTIC ROOTS AND VECTORS 


Theorem 13.3.2. If A (p x p) has the distribution WU, n), then the charac- 
teristic roots (1, 21,2 +++ 21, > 0) have the density (11) over the range where 
the density is not 0. 


Corollary 13.3.1. Let v, 2 +++ 2u, be the sample variances of the sample 
principal components of a sample ae size N=n+1 from N(p,o7!). Then 
(n/a*)u, are distributed with density (11). 


The characteristic vectors of A are uniquely defined (except for multipli- 
cation by — 1) with probability 1 by 


(12) (A-I)\y=0, y'y=1, 

since the roots are different with probability 1. Let the vectors with y,, = 0 be 
(13) Y= (yi,.005¥p)> 

Then 

(14) AY=YL, 


From Section 11.2 we know that 

(15) YY=I, 
Multplication of (14) on the right by ¥~’ = Y’ gives 
(16) A=YLY’, 


Thus Y’ = C, defined above. 
Now let us consider the joint atripation of L and C. The matrix A has 
the distribution of 


(17) A= x XX, a? 
where the X, are independently distributed, each according to N(O, J). Let 


(18) xX, = O,, 


where Q is any orthogonal matrix. Then the X* are independently dis- 
tributed according to N(O, I) and 


(19) Av = YO XEX* = Q49" 


a=I 


13.3 THE CASE OF ONE NONSINGULAR WISHART MATRIX 541 


is distributed according to WU, n). The roots of A* are the roots of A; thus 
(20) A* = Cc** 'LC** : 
(21) ct HORE =/J 


define C** if we require c#* > 0. Let 


(22) C*=CQ'. 
Let 
ch 
[e* | 0 0 
0 C3 va 0 
* 
(23) ncty=} IH Ak 
ce 
0 0 pl 
lee 


with c¥/|e4| =1 if ci} =0. Thus s(C*) is a diagonal matrix; the ith 
diagonal element is 1 if c}, = 0 and is —1 if ch}, <0. Thus 


us 
(24) C** = J(C*)C* = J(CQ')CQ’. 


The distribution of C** is the same as that of C. We now shall show that 
this fact defines the distribution of C. 


Definition 13.3.1. If the random orthogonal matrix E of order p has a 
distribution such that EQ' has the same distribution for every orthogonal Q, then 
E is said to have the Haar invariant distribution (or normalized measure). 


The definition is possible because it has been proved that there is only one 
distribution with the required invariance property [Halmos (1950)]. It has also 
teen shown that this distribution is the only one invariant under multiplica- 
tion on the left by an orthogonal matrix (ie., the distribution of QE is the 
same as that of FE). From this it follows that the probability is 1/2’ that E is 
such that e,, = 0. This can be seen as follows. Let Jj,...,J:2 be the 2? 
diagonal matrices with elements +1 and — 1. Since the distribution of J, E is 
the same as that of E, the probability that e,,20 is the same as the 
probability that the elements in the first column of J,E are nonnegative. 
These events for i=1,...,2? are mutually exclusive and exhaustive (except 
for elements being 0, which have probability 0), and thus the probability of 
any one is 1/2?. 


542 THE DISTRIBUTIONS OF CHARACTERISTIC ROOTS AND VECTORS 


The conditional distribution of E given e, =0 is 2” times the Haar 
invariant distribution over this part of the space. We shall call it the 
conditional Haar invariant distribution. 


Lemma 13.3.2. If the orthogonal matrix E has a distribution such that 
€,, 20 and if E** = J(EQ')EQ' has the same distribution for every orthogonal 
Q, then E has the conditional Haar invariant distribution. 


Proof. Let the space V of orthogonal matrices be partitioned into the 
subspaces V,,...,V» So that J,V,=V,, say, where J, = J and V, is the set for 
which e,, = 0. Let 4, be the measure in V, defined by the distribution of EF 
assumed in the lemma, The measure »(W) of a (measurable) set W in V, is 
defined as (1/2”)u, JW). Now we want to show that w is the Haar 
invariant measure. Let W be any (measurable) set in V,. The lemma assumes 
that 2°u(W) = w(W) = PE © W} = Pr{E** ©W} = Lu {we’ nv)= 
2*u(WQ'). If U is any (measurable) set in V, then U = Ue: (UN V;). Since 
MUNK) =(1/2?)wL(U NV), by the above this is ul(UNV,)Q']. Thus 
w(U ) = w(UQ'). Thus w is invariant and y, is the conditional invariant 
distribution. = 


From the lemma we see that the matrix C has the conditional Haar 
invariant distribution. Since the distribution of C conditional on L is the 
same, C and L are independent. 


Theorem 13.3.3. If C=Y', where Y=(y,,...,¥,) are the normalized char- 
acteristic vectors of A with y,,20 and where A is distributed according tc 
WU.n), then C has the conditional Haar invariant distribution and C is 
distributed independently of the characteristic roots, 


From the preceding work we can generalize Theorem 13.3.1. 


Theorem 13.3.4. If the symmetric matrix B has a density of the form 
gd... 1), where 1, > --- >1, are the characteristic roots of B, then the joint 
density of the roots is (2) and the matrix of normalized characteristic vectors Y 
(y,, = 0) is independently distributed according to the conditional Haar invariant 
distribution. 


Proof. The density of QBQ’, where QQ’ = I, is the same as that of B (for 
the roots are invariant), and therefore the distribution of J(Y’Q')Y’Q’ is the 
same as that of ¥’. Then Theorem 13.3.4 follows from Lemma 13.3.2. a 


We shall give an application of this theorem to the case where B= B’ is 
normally distributed with the (functionally independent) components of B 
independent with means 0 and variances &b? = 1 and &b}, =$( <j). 


13.4 CANONICAL CORRELATIONS 543 
Theorem 13.3.5. Let B = B' have the density 

(25) qr POPE /4 =P gm eB? 

Then the characteristic roots 1, > +++ >1, of B have the density 


A Zien 1 
(26) 27 Pap PCP ar; *(bpdeno| ~ > EH) 1 ~4) 


i=] 


and the matrix Y of the normalized characteristic vectors (y,; > 0) is indepen- 
dently distributed according to the conditional Haar invariant distribution. 


Proof. Since the characteristic roots of B? are /f,...,1? and tr B?=TI?, 
the theorem follows directly. a 


Corollary 13.3.2. Let nS be distributed according to WU, n), ana define the 
diagonal matrix L and B by S=C'LC, C’C=I, 1,> --- >l,, and cy, 20, 
i=1,...,p. Then the density of the limiting distribution of vn(L—-I)=D 
diagonal is (26) with I, replaced by d,, and the matrix C is independently 
distributed according to the conditional Haar measure. 


Proof. The density of the limiting distribution of yn (S — J) is (25), and the 


diagonal elements of D are the characteristic roots of yn(S —J) and the 
columns of C’ are the characteristic vectors. | 


13.4. CANONICAL CORRELATIONS 


The sample canonical correlations were shown in Section 12.3 to be the 
square roots of the roots of 


(1) | Ay Any ~ fAy,|= 0, 
where 
N _ — 
(2) Aya Y (XM -XO\ XO - XO), ip lye 
a=! 


and the distribution of 


xXx 
(3) X= By 


544 THE DISTRIBUTIONS OF CHARACTERISTIC ROOTS AND VECTORS 


is N(w, &), where 


Xa Xa 
4 r= 
4) > Sn 


From Section 3.3 we know that the distribution of A,, is the same as that of 
a n . 
(5) Ags YN i,j = 1,2, 
a=) 


where n=N-—1 and 


y) 
(6) Y= i 


is distributed according to N(O, ), Let us assume that the dimensionality of 
Y‘, say p,, is not greater than the dimensionality of Y, say p,, Then there 
are p, nonzero roots of (1), say 


(7) fi>h>- >f,,- 
Now we sliall find the distribution of {f,} when 
(8) X.=0. 
For the moment assume {Y{”} to be fixed. Then A,, is fixed, and 
(9) B=A,,A5)} 


is the matrix of regression coefficients of ¥‘”) on ¥, From Section 4.3 we 
know that 
a 


(10) Aya= D (¥{0 — BY )\(¥ — BY)" = A, — BA, B’ 


a=! 
=A ~AyAzAy 

and 

(11) Q = BA, B' =A, Ay Ay 


(B =0) are independently distributed according to W(%X,,,n—-p2) and 
W(X), Pz), respectively. In terms of Q the equation (1) defining f is 


(12) |Q-f(An2.+@)|=0. 


135 ASYMPTOTIC DISTRIBUTIONS IN CASE OF ONE WISHART MATRIX 545 


The distribution of f,, i= 1,..., p,, is the distribution of the nonzero roots of 
(42), and the density is given by (see Section 13.2) 


1B) a (2”) 
( ) 7 aE (n= p)|T,, (sP)T,(3 P2) 
Py 
Ou ala Coe) elaeai HET, -f). 


Since the conditional density (13) does not depend upon Y, (13) is the 
unconditional density of the squares of the sample canonical correlation 
coefficients of the two sets X{” and X©, a =1,..., N. The density (13) also 
holds when the X are actually fixed variate vectors or have any distribu- 
tion, so long as X“ and X are independently distributed and X" has a 
multivariate normal distribution. 

In the special case when p, = 1, p, =p -— 1, (13) reduces ta 


rs(N~1)] 
r3(N—p)]T[3(p- 


1)] fala pyres 


which is the density of the square of the sample multiple correlation coeffi- 
cient between X“? (p, = 1) and X® (p, =p —~ 1). 


13.5. ASYMPTOTIC DISTRIBUTIONS IN THE CASE OF 
ONE WISHART MATRIX 


13.5.1. All Population Roots Different 


In Section 13.3 we found the density of the diagonal matrix L and the 
orthogonal matrix B defined by S=BLB’, /,> --- 1,, and b,,20, i= 
1,,..,p, when nS is distributed according to W(J, n). In this section we find 
the asymptotic distribution of L and B when nS is distributed according to 
W(%,n) and the characteristic roots of & are different. (Corollary 13.3.2 
gave the asymptotic distribution when 2 = J.) 


Theorem 13.5.1. Suppose nS has the distribution WC2,n). Define diagonal 
A and L and orthogonal B and B by 


(1) 2=BAB’. S=BLB’, 


A, > A> >A, FL 2l2-- 21, By, 20, 6,20, i=1,...,p. Define 
G = vn (B — B) and diagonal D = yn (L ~ A). Then the limiting distribution of 


546 THE DISTRIBUTIONS OF CHARACTERISTIC ROOTS AND VECTORS 


D and G is normal with D and G independent, and the diagonal elements of D 
are independens. The diagonal element d, has the limiting distribution N(O,2 A?). 
The covariance matrix of g, in the limiting distribution of G = (g,,--.,8,) is 


is A, A, ‘ 

(2) 6(g,) = 1 ——= BiB. 

k=1 ( A, Ay) 

ket 
where B=(B,,.-.,B,). The covariance matrix of g; and g, in the limiting 
distribution is 
(3) ¥6 (8,,8)) “8.8 inj 

‘ £,,8;) = ~ —— SB Bi, i+j, 


Proof. The matrix nT = nB’SB is distributed according to W(A,n), Let 
(4) T=YLY’, 


where Y is orthogonal. In order that (4) determine Y uniquely, we require 
v, 20. Let Vn(T—A)=U and ¥n(¥—I) = W. Then (4) can be written 


a 
vn 


which is equivalent to 


fey 


(5) At a 


U = 


nebo] 


] ' 
i+w}, 
vn 


(0) U=WA+D+ AW + (WD +WAW' + DW’) + LWDW’. 


vn 
From I= YY’ =(1+(1/vn WIl+(1/vn)W’], we have 
(7) O= WW WW. 
n 


We Shall proceed heuristically and justify the method later. If we neglect 
terms of order 1/V¥n and 1/n (6) and (7), we obtain 


(8) U=WA+D+AMN", 

(9) O=Wiw'. 

When we substitute W’ = —W from (9) inta (8) and write the result in 

components, we obtain w,, = 0, 

(10) d,=uU,,, eal eve re J 
u 


(11) A as ES i+], bd Sly ce Pe 


13.5 ASYMPTOTIC DISTRIBUTIONS IN CASE OF ONE WISHART MATRIX 547 


(Note w,; = —w,,.) From Theorem 3.4.4 we know that in the limiting normal 
distribution of U the functionally independent elements are statistically 
independent with means 0 and variances ¥(u;;)= 2A; and &V(u,,)= 
A; Aj» i*j., Then the limiting distribution of D and W is normal, and 

previa dysWiasW 39-569 My, p are independent with means 0 and variances 
AV (d;)) = 207, i= 1,...,p, and SV(w,,) = dj)Aj/(Aj — A)’, JHE Loy Ds 
i=1,...,p—1. Each column of B is + the corresponding column of BY; 
since Y5J, we have BY>8, and with arbitrarily high probability each 
column of B is nearly identical to the corresponding column of BY. Then 
G=vn(B-B) has the limiting distribution of B/n(Y—I)=BW. The 
asymptotic variances and covariances follow. 

Now we justify the limiting distribution of D and W. The equations 
T= YLY' and I= YY’ and conditions /, > ++: >/,, y;,,>0, i=1,..., p, define 
a 1-1 transformation of T to Y,L except for a set of measure 0, The 
transformation from Y, L to T is continuously differentiable. The inverse is 
continuously differentiable in a neighborhood of Y=J and L =A, since the 
equations (8) and (9) can be solved uniquely. Hence Y, L as a function of T 
satisfies the conditions of Theorem 4.2.3. = 


13.5.2. One Root of Higher Multiplicity 


In Section 11.7.3 we used the asymptotic distribution of the gq smallest 
sample roots when the q smallest population roots are equal. We shall now 
derive that distribution, Let 


A, 0 
(12) a-| 0 ca 


where the diagonal elements of the diagonal matrix A, are different and are 
larger than A* (> 0). Let 


Ti =| 


L, 0 
(13) «T= L=-|; 
T,, Ty 


Then T% A, which implies LA, ¥,, >J, ¥,. 20, Y,,; 50, but ¥,. does 
not have a probability limit. However, Y¥;, >J,. Let the singular value 
decomposition of Y,, be EJF, where J is diagonal and E and F are 
orthogonal. Define C,=EF, which is orthogonal. Let U= vn (I-A) and 
D=yn(L—A) be partitioned similarly to T and L. Define W,,= 
vn (Yi, — 2), Wi = vn Yio, Wy, = Vn Yo, and W. = Vn (Yo, =C,)= Vn EVI - 


548 THE DISTRIBUTIONS OF CHARACTERISTIC ROOTS AND VECTORS 


I,)F. Then (4) can be written 


A, 0 
q 


+. — 
0 ae n\0 D, 
Ida 0 1 {Wn Wa 
,| te ’ ! 
0 C, nm Wi Wy 
A, 0 1 D, 0 
= ss 
0 ANT, vn 0 C,D,C, 


A,Wh A Wy 


WA, WC, 
MC Wi, CW, 


Wy Ay MWC, 


1 
|+ ta 


where the submatrices of M are sums of products of C,, Aj, A"I,, Dy, Wer 
and 1/y¥n. The orthogonality of Y (1, = YY’) implies 


+ —_N, 


(is) I= Inq 0 fecal Wi WaG, £ Wi, W3, 
P 0 I, vn Wy Wal, CW CW}, n 


where the submatrices of N are sums of products of W,,. From (14) and (15) 
we find that 


(16) Uy = C,D,C, + O,(1/vn). 


The limiting distribution of (1/A*)U,, has the density (25) of Section 13.3 
with p replaced by qg. Then the limiting distribution of D, and C, is the 
distribution of DZ and Y4 defined by U*, = YX, D¥ YZ’, where (1/A* )U#, has 
the density (25) of Section 13.3. 


13.6 ASYMPTOTIC DISTRIBUTIONS IN CASE OF TWO WISHART MATRICES 549 


Theorem 13.5.2. Under the conditions of Theorem 135.1 and A= 
diag(A,, A*I,), the density of the limiting distribution of d,_,4,,---,d, ts 


-4 = Si ] us 5 
(17) 27 #( atm) "9-47 $(1p) exp ar eee a] TN 6-4), 
r=p-qtl Sy 


To justify the preceding derivation we note that D, and Y,, are functions 
of U depending on n that converge to the solution of US = Y5 D3, Y%). We 
can use the following theorem given by Anderson (1963a) and due to Rubin. 


Theorem 13.5.3. Let F.(u) be the cumulative distribution function of a 
random matrix U,. Let V, be a matrix-valued function of U,, V,=f,(u,), and 
Let Gv) be the (induced) distribution of V,. Suppose F(u) > F(u) in every 
continuity point of F(u), ana suppose for every continuity point u of flu), 
f,(u,) af(u) when u, ou. Let Gv) be the distribution of the random matrix 
V=f(U), where U has the distribution F(u) If the probability of the set of 
discontinuities of f(u) according to F(u) is 0, then 


(18) lim G,(v) =G(v) 
in every continuity point of G(v). 
The details of verifying that U(n) and 


(19) (D2(n), Yo(n)) =f. U(nn)) 


satisfy the conditions of the theorem have been given by Anderson (1963a). 


13.6. ASYMPTOTIC DISTRIBUTIONS IN THE CASE OF 
TWO WISHART MATRICES 


13.6.1. All Population Roots Different 

In Section 13.2 we studied the distributions of the roots /,2/,2 --- 21, of 
(1) |S* —IT*|=0 

and the vectors satisfying 


(2) (S* —IT*)x* =0 


550 THE DISTRIBUTIONS OF CHARACTERISTIC ROOTS AND VECTORS 


and x*’T*x* = 1 when A* =mS* and B* =nT* are distributed indepen- 
dently according to W(X, m) and W(X, n), respectively. In this section we 
study the asymptotic distributions of the roots and vectors as n > oo when A* 
and B* are distributed independently according to W(®,m) and W(X, 7), 
respectively, and m/n— n> 0. We shall assume that the roots of 


(3) |® ~ AZ| =0 


are distinct. (In Section 13.2 Ay= ++ =A, = 1) 

Theorem 13.6.1. Let mS* and nT* be independently distributed according 
to W(®, m) and W(X, n), respectively. Let 4y:>A, ><" >A, (> 0) be the 
roots of (3), and let A be the diagonal matrix with the roots as diagonal elements 
in descending order; let y,,...,, be the solutions to 


(4) (®—A,X)y=0, i=1,...,p, 


y' Sy =1, and y,, 2 0, and let P= (yy,.065Y,) Letl2 zl, (> 0) be the 
roots of (1), and let L be the diagonal matrix with the roots as diagonal elements 
in descending order, let x},..., x7, be the solutions to (2) for! =1,,i=1,..., p, 
x*'T*x* = 1, and xt, > 0, and let X* =(xf,..., x3). Define Z* = vn (X* ~T) 
and diagonal D= ¥n(L~— A). Then the limiting distribution of D and Z* is 
normal with means 0 as n> 0%, m —> 00, and m/n > n (> 0). The asymptotic 
variances and covariances that are not 0 are 


(1+ 
(5) ta) <2 CD 
BAAR +; ; 
(6) AGE (zi) y Aide FON) oy 2ViVi> 
kat NC Ax = A;) 
kee 
(7) GE (d,, 27) = Ai, 
AA(L+n) , a4 
(8) AE (2, zt) = — i ij. 
n(A;- Ai) 
Proof. Let 
(9) S=V'S'r, T=YF'T*r. 


Then mS and nT are distributed independently according to W(A, m) and 
WU, n), respectively (Section 7.3.3), Then /,,...,/, are the roots of 


(10) |S —/T| =0. 


13.6 ASYMPTOTIC DISTRIBUTIONS IN CASE OF TWO WISHART MATRICES 351 
Let x),...,”, be the solutions to 
(11) (S-LT)x=0, i=1,...,D, 


and x'Tx = 1, and let X=(x,,...,x,). Then x* =[x, and X* =[‘X except 
possibly for multiplication of columns of X (or X*) by — 1. If Z=yvn(x—J), 
then Z* = [Z (except possibly for multiplication of columns by —1). 

We shall now find the limiting distribution of D and Z. Let ¥n(S— A)=U 
and ¥n(f—J)=V. Then U and V have independent limiting normal distri- 
butions with means 0. The functionally independent elements of U and V are 
statistically independent in the limiting distribution. The variances are &u’, 
= An/m)r > 207 /y; Eu? =(r/m)d,A, > A,A;/n ttf; Cus = 2; €u2=1, 
ij. 

From the definition of L and X we have SX=TXL, X’TX=I, and 
X'SX=L. If we let X"' =G, we obtain 


(12) S=G'LG, T=G'G. 


We require g,, > 0,i=1,..., p. Since S 4A and T 4], we have L 5A and 
G 21, Let ¥n(G —I) =H. Then we write (12) as 


(13) a= tu=(1+ w’|[A+ Eo)[r+ Za], 
(14) I+ ade [r+ sen'\(rs eH}. 


These can be rewritten 


: 1 1 
(15) a a ir eel ae ce eee Om 
n 
1 
16) V=H+H’ + —H'H. 
” Tn 


If we neglect the terms of order 1/ yn and 1 /n (as in Section 13.5), we 
can write 


(17) U=D+AH+H'A, 
(18) V=H+H', 
(19) U-VA=D+AH-HA, 


552 THE DISTRIBUTIONS OF CHARACTERISTIC ROOTS AND VECTORS 


The diagonal elements of (18) and the components of (19) are 


(20) u, = 2h,,, 
(21) Wy — AW,; = d,, 
(22) Wij ~ UjA,= (A) — AA,» i#j. 


The limiting distribution of H and D is normal with means 0. The pairs 


(f,;,h,,) of off-diagonal elements of H are independent with variances 
A(A,+ 7A 
(23) AV (h,) = Ling ay 
n( A; > A,) 
and covariances 
A; A,(1 + 
(24) AE (Ny, hy) = — oa Sea inj. 
n(A,~ )) 


The pairs (d,,h,,) of diagonal elements of D and H are independent with 
variances (5), 


(25) AV (hj) =4, 
and covariance 
(26) ME (d,,h,,) oS —A,. 


The diagonal elements of D and H are independent of the off-diagonal 
elemeuts of H. 

That the limiting distribution of D and H is normal is justified by 
Theorem 4.2.3. S and T are polynomials in L and G, and their derivatives 
are polynomials and hence continuous, Since the equations (12) with auxiliary 
conditions can be solved uniquely for £ and G, the inverse function is also 
continuously differentiable at L=A and G=J. By Theorem 4.2.3, 
D=yn(L- A) and H=vn(G—TI) have a limiting normal distribution, In 
turn, X=G7! is continuously differentiable at G=J, and Z=y¥n(X—-J) 
= ¥n(G7!—J) has the limiting distribution of —H. (Expand va {[1+ 
(1/¥n)H]"' —1}.) Since G57, X51, and x,,>0,i=1,..., p with proba- 
bility approaching 1. Then Z* = yn(X* —T) has the limiting distribution of 
TZ. (Since X51, we have X*="VX4T and x,,>0, i=1,...,p, with 
probability approaching 1.) The asymptotic variances and covariances (6) to 
(8) are obtained from (23) to (26). a 


13.6 ASYMPTOTIC DISTRIBUTIONS IN CASE OF TWO WISHART MATRICES §§3 


Anderson (1989b), has derived the limiting distribution of the charactcris- 
tic roots and vectors of one sample covariance matrix in the metric of another 
with population roots of arbitrary multiplicities. 


13.6.2. One Root of Higher Multiplicity 


In Section 13.6.1 it was assumed that mS* and nT* were distributed 
independently according to W(®, m) and W(X, n), respectively, and that the 
roots of |@ — AL] = 0 were distinct. In this section we assume that the k 
larger roots are distinct and greater than the p—k smaller roots, which are 
assumed equal, Let the diagonal matrix A of characteristic roots be A= 
diag(A,, A*J,_,), and let { be a matrix satisfying 


(27) ®P=SFA, PEC=. 


Define S$ and T by (9) and diagonal L and G by (12). Then SSA, TI, 
and LA. Partition S, T, L, and G as 


Sy, Si T,, I, 
28 Ss = J T = ° ’ 
( ) Ke Sy T,, T,, 


where S,,,7,,, L,, and G,, are k Xk. Then G,, Ay. Gy 4, 0, and G., >0. 
but G,, does not have a probability limit. Instead Gi,G.. 1,_,. Let the 
singular value decomposition of G,, be EJF, where E and F are orthogonal 
and J is diagonal. Let C, = EF. 

The limiting distribution of U = vn(S — A) and V= yn(T —1) is normal 
with the covariance structure piven above (12) with A,,,; = -> = )*, 
ae D=yn(L—A), Hy =Vn(G,,-1), Hy = Vn Gy, Wy = ae, G.,. and 

= Vn(G,,—C,) = Wn E(J — I,-,)F. Then (13) and (15) are replaced by 


A, 0 1/8, Gp 
(29) z + —— 


te Wo ss a as 0 
~ 1 1 
—H C, — H; 0 wT + = D, 
in vn” ae oe 


554 THE DISTRIBUTIONS OF CHARACTERIST'C ROOTS AND VECTORS 


| l 
I.+—=H —H_,, 
k Vin il Vn I} 
| { 
—H C,.+ ~dH,, 
vn n'y 
2 Ay 0 ihe ots D, 0 eee A.A, A Ay. 
0 ME | va | 0) CLDIC2| 7 Vn | atChHy, aX ChHy, 
1 |Hi,A, HC, 0 1 
Ss Fee as 
Va | HisA, HCL| On P 
and (14) and (16) are replaced by 
I 60 1 Vie Vs 
30 + — a 
( ) | 0 a vn Ki Vi 
ee H: I+ ty : H 
- gt i Got in 
| Le, co+te.|| La, c+-e 
vn 12 2 vin 22 vit 21 2 vn 22 


7 I 0 ge she Hy, H,, je Hi, A, aftneH 
O tow | Van | Hn CiH2| > vn | Hip Hy,C,| no 


If we neglect the terms of order 1/¥n and 1/n, instead of (19) we can write 


(31) 
U,,-ViA, U,-wV i, 7 D,+A,H,-HyA, (I-A) AyC, 
U,,-Vsy A, Un - Vy CL Ha (AI V)) C,D,C, 


Then: Oj, 2h PS Ty cee 5 Ky My AG Ey TS Myre Ky hy By ay 
(A,-ADh,, 6%), 7 =1,...,k) Uy — Va = CLD C,; C(Uy - Vy, A) = 
H.(a°I-A,), and (Uj, ~— A* VC, = (A* 1 - A). The limiting distri- 
bution of U,;-A*V,, is normal with mean 0; &(u,,— A*u,,)° = 6d? = 
2x +9)/n, ik +1,...,p3 and &u;,~ Mv, = M71 + y)/9, i#), 
i,j =k+1,..., p. The limiting distribution of D, and C, is the distribu'ion 
of D, and C, defined by U,. — A*V,, = C,D,C, where (1/a* (U2, — V2.) has 
the density of (25) of Section 13.3. 


13.7 ASYMPTOTIC DISTRIBUTION IN A REGRESSION MODEL 555 


13.7. ASYMPTOTIC DISTRIBUTION IN A REGRESSION MODEL 


13.7.1. Both Sets of Variates Stochastic 


The sample canonical correlations /,,...,/,, and vectors @),...,@,, and 
Vis Vs are defined in Section 12.3. The set ¥,,...,¥,, and Hywel, are 
defined by 

(1) Sy SSF =SyI?, — ¥'SyP=. 


The asymptotic distribution of these quantities was given by Anderson 
(1999a) when X= (X’, X@’)’ has a normal distribution and also when X“? 
is normally distributed with a linear function of nonstochastic X® as 
expected value. We shall now find the asymptotic distribution when X has a 
normal distribution. The model in regression form is 


(2) XO = BX® 4+ Z, 


where X®) and Z are independently normally distributed with expected 
values €X® =Q and &Z=Oand covariances €X°X% = 3, €ZZ' = Lz, 
(&XZ'=0), Then €X=0 and €XMXO'=¥,, = X77 + BLX..B’ and 
& XX)’ = BY... Inference is based on a sample of X of n observations. 

First we transform to canonical variables U= A’'X”, V=I'X®, and 
W = A'Z, Then (1) is transformed to 


(3) U=OV+W, 


where @=4’B([')"', @UU' = Xyy=t,, CW = Xyy=l,,, @U' = Xyy 
=(A,0)=A, &WW'=Xyy=1,,— A, and &VW’ =0, [See (33) to (37) 
and (45) of Section 12.2.] Let the sample covariance matrices be Syy = 
A'S,,A’, Spy = A'S, I, and Sy,=T'S,,.1. Let the sample vectors consti- 


A 


tule H=Y'T=P"'(4,,...,4%,, Then H satisfies 
(4) SyySpuSyy H = Syp HA”, H'SyyH=1,, 
where A*:= diag(A,,...; NOs rag Os if p,; <p, there are p, — p, O’sin At. 


We have Syy 91, Syy O1,,, Suy & A. Then A = diag(Ay,...,A,,) A, 
Let 


H,, Hy 
a H(t) =| a 


556 THE DISTRIBUTIONS OF CHARACTERISTIC ROOTS AND VECTORS 


where #7, is p, Xp, and H2, is (p2—p,) X (p, —p,). The first p, columns 
ot (4) are 


(6) Syy Sz Suv, = Spy HA; 


the last P2— p, columns of (4) are S),H, = 0. Then H,, 41, Be *,0, and 
H,, 0, but the probability limit of (4) only implies H,H,, & dag: Leet 
the singular value decomposition of H,, be H., = EJF. 
aoe Sy = Vn(Syy~I,,), Shy =Va(Sy,—1,,), Shy = Vn (Syy— A), 
= Vn CH, ~ I(,,))» and Pa Watae A), 0], where I,,.)=(/,,0)'. Then 
ea of (6) yields 


=| 
oe vo 1 Sos 1 
(7) A + FS | 1, + Stu | Re 2Sty (hyp + ut 
_ 1 Me hha, Aas ; 


From (7) we obtain 


(8) Sty AL, AL, + A’SiyL,,) +A’AH* 


Pi) 
= Styl) + HEA + 26, A A* +0,(1) 

From H;S,,H, =I, we derive 

(9) Hi, + Hi! = —Spy +0,(1). 

In terms of partitioned matrices (8) is 


Shy A ASA + ASH) — Shy A? 
(10) be VV 


Sti A — St7A2 


2A A* + H* A? — A? H* 
| r - +0,(1). 


~ | HRA? 


The lower submatrix equation [( p, — p,) X p,] of (10) is 


(11) HE} A=S}i} - SEPA + 0,(1) = Shy +0,(1) = (SH )' +0,(1). 


13.7 ASYMPTOTIC DISTRIBUTION IN A REGRESSION MODEL 557 


A diagonal element of the upper submatrix equation of (10) is 
n 
(12) Ma x [( sate A.) — $A,(u2, — 1) - $a,(v2, - 1] +0,(1). 


The right-hand side of (12) is the expansion of the sample correlation 
coefficient of u,, and u,,. See Section 4.2.3. The limiting distribution of A* is 
N{0,(1 — A?)?], 

The (7, jth component of H* in (10) is 


(13) 


n 
(a — a7 )h* rae (Aaya + AM Ya ~ MAMia Hye ~ AZ YqYa) + Op 1). 


ey ae fo ta“ja 


bey EJ= 1... py. 


The asymptotic covariance of (A) — a? )h¥, and (A? — Ap )h*, is 


(1 - (4? or ie 2A7d?} (1- )(1 - V(X +7) 


WO [a-ayaayare) (= a)(reag— aan) f 


The pair (h*, his) i is uncorrelated with other pairs, . 

Suppose P) =P). Then [* = Real =TA*. Let P=(y,,....9,,) T= 
(Fis. ¥,). Then $7 = D2 y,hy), where hi, i + j, is ene from (13) and 
h*. from (6), We obtain 


5 t 
(15) n€(4;-¥))(4%) — ¥)) = 37,4) + (1-47) L 


ia Oba 
, (LAAT af)? +7 
(16) mOSRoaas! a Dips jel 
] t 


Anderson (1999a), has also given the asymptotic covariances of & ; and of ¥, 
and &,. Note that hf depends linearly on (u,,,u,,) and that the pairs 
(Uj45U;q) and (uj;,,U;,), i#j, are uncorrelated. The covariances (14) do not 
depend on (U,V) being normal. 

Now suppose that the rank of I',, is k<p,. Define H, = (Hj,.H3,)' as 
the first k columns of H satisfying (4), and define A, = diag(A,,..., A,). 
Then H, satisfies (6), and H} satisfies (8), (9), (10), and (11). Then a* is 
given by (12) for i= 1,...,4. 


538 THE DISTRIBUTIONS OF CHARACTERISTIC ROOTS AND VECTORS 


The last p, ~&k columns of (4) are 


x12 _ AZ * 
(17) A Sty C, = AY AY +0,(1). 
0 0 
Hence 
(18) AHS = —ShyC, +0,(1) = -SELC, +0,(1). 


13.7.2, One Set of Variates Stochastic and the Other Set Nonstochastic 


Now consider the case that X@ in (2) is nonstochastic, where ¢Z, = 0 and 
€Z,2',= %77. We observe X=xX,,...,X,. We assume 


(19) Sy=_ Lx’ > 2p, 
az) 
and %., (s nonsingular. Then 
, iE , 
(0) Si= = aby x)x0)’ = BSB + Sz,B’ + BS.z + Szz > BEB’ + Zzz, 
o= | 


1 


Mu as Pp 
(21) ar xX 2x! = BS, +52. > BE. 


Define A,a, and y by solutions to 


— A( X77 + BS..B’) BS, ||a@ 
22 = 0, 
\ ) SB ole 
(23) a'(X77 + BS,.B’)a = 1, Y¥'Syy - 1, 


We shall first assume p, =p, and A, > -- >A, > 0. Then (22) and (23) 
and a@,, > 0 define 


(24) diag(ar,....4),)=Ans  (@1..--,0),) =A 


7 PI 


ay (Yeertg) Hh 


Let U=A,X, v= x4, a= 1,...,.2, W=A,Z, O=A,BI) | = 
H=T,'T. Then H and A satisfy (4). Then S,, =J, 


(25) Syy = OS yy + Sy =O +Syy > O, 


(26) Soy = OS yO + OS yt SyyO+Syy od. 


13.7 ASYMPTOTIC DISTRIBUTION IN A REGRESSION MODEL 559 
Then (4) can be written 
(27) (A4+S,y)(A?+ASyy+SwyA+Swy) (A+ Syy)H=HA® 


Note that Syy 0, Sy, > 1— A, and hence HI, ASA, 
Let Sky = Vn Syy, Shy = Va [Syy ~ (— A]. Then (27) leads to 


(28) (1 — A’) S}, A+ AS* (I~ A’) —AShy A+ A?H* 
= H*A? + 2A A* +0,(1). 


A diagonal term of (28) gives 


(8) a= (1B) FY ae Fase D [we ~ (I~ ¥8)] +0401). 
Since 

(30) E(UqM_) = 02,(1—A?), 

(31) é[w2, (1 —2)] = 21-2)’ 


under the assumption that W is normally distributed, the limiting distribution 
of vn (A, —A,) is N[0,(1 — A?)2C1 — $A2)]. Note that this variance is smaller 
than in the case of X® stochastic. 

From (28) we find 


(32) 
n 
(Aj ay A? ) ht, a = u (1 = AF) Ue Wie A, ay AW; ig ( 1 2 d?) PY NM, Wig Aj 
+0,(1) 
Then 
(33) (AP APY 8( 45)? > (1 — A) (1 ~ aP)(AP ~ AF aI). 


The equation H'S,,H =J implies H’H =I, leading to H* = - H*' +0,(1), 
that is, hf = —h%, + 0,(1). 

Now suppose that the rank of B is k<p,=p.. Then A =diag(A,, 0), 
where A, = diag(dA,,... A,). Let Y=(1,,1,), where F, has & columns and 
Y, has p, — k columns. Define the partition (5) to be made into k and p, —k 
rows and columns. The probability limit of (4) implies Hy, >1,, H,, > 9, 
Hy, > 0, and Hi, Hy, >I. Let the singular value decomposition of H,. be 


560 THE DISTRIBUTIONS OF CHARACTERISTIC ROOTS AND VECTORS 


EJF, where J is a diagonal matrix of order p,—k and E and F are 
orthogonal matrices of order p, ~ k. Define C, = EF. The expansion of (4) 
in terms of Sty=vn(Syy~A), Stp=vn(St,—A), Sty =vnlSyy- 
(- A), Hi =vn( Hy, -D, HK = Vn H,, Hi, =VvnH,,, and Hs = 
Vn (Hy — C,)= Vn EJ -1)F yields 


A, SHY (1-3) + (1- AA) Shy A, -AVSEYA, A, SHEC, 


34 
oD StZA, 0 


2A,A*+HAMI- ACHR -AU RS 


4 +0, 1 
Hy, AY 0) 


The ith diagonal term of (34) is (29) for i= 1,...,&. The i, jth element of the 
upper left-hand submatrix is (32) for i#j and i,j =1,...,k. Two other 
submatrix equations of (34) are 

(35) A, Hi = —SEYC, +0,(1), 

(36) H3,A\ = Sty +o,(1). 

The equation I= H’S,, H = H'H yields 


Hi, + At Hs +HiC, 


37 J 7 
ie. CH +HE C,H + HRC, 


=0+0,(1). 
The off-diagonal submatrices of (37) agree with (35) and (36). 


13.7.3. Reduced Rank Regression Estimator 


When the rank of B is specified to be k (<p,), the maximum likelihood 
estimator of f is 


A 


(38) B,=S 00%. 


See Section 12.7. In terms of (3) the reduced-rank regression estimator of @ 
is 


(39) ©, =SyyH, Mi. 
roe X® is oe and © = diag a We define 


= ¥n(@, - H¥ =Va(H, -Iy) Shy - va (Suy - and S*, 
ae eee HiS,,H,=I we find Y* + He’ = een 


13.7 ASYMPTOTIC DISTRIBUTION IN A REGRESSION MODEL 561 


From (39) and (9) we obtain 


ell * * *T 
R by +A Hi + A A | Hy 
(40) @* -| a2! ! +0,(1) 
Vv 
sy) 833 
-| #2) +o0,(1). 
WV 0 


We can compare @, with the maximum likelihood estimator unrestricted by 
a rank condition © = S,,S;y. Then 


(41) @* = yn (O - B) = (Sy) - OS, y)) Sp} 


sy!) sy 
= Shy +o,(1) = sx?! gx22 +o,(1), 


Wy 


since S,,—J. The effect of the rank restriction is to replace the lower 
right-hand submatrix of S#,, by 0 (the parameter value), 

Since S*,, =(1/ Vn )JL"_,W,V’, we have vec St, =(1/vn )L"_(V, ® W,). 
Because V, and W, are independent, 


(42) & vec Syy(vec Si,,)' 

= &VWV' ® &WW' =1@(1— A’) =diag(1— Az,...,1— A). 
where A= diag(A,,0) and J— A’ = diag(J— Aj, J), On the other hand 
me a 
—= WV foe (vOlio 1 
dn 2 | a | 0 a (1) 


VO ® W. 


es +0,(1), 
0 


(43) vec @* = vec 


where V, = (V{0" V2") and W, = (WO), Wi)", Then 
(44) 
1, ® (1, - A’) 0 
& vec @* (vec @F \’ > -A2 
i (vec OF ) ‘ 1,[" ce 4 


=diag(I, — A®,....1,,—A?, 4, — A4.0.....4 — 44,0) 


where there are k blocks of f A? and p, —k blocks of diag, — A;,0) 


562 THE DISTRIBUTIONS OF CHARACTERISTIC ROOTS AND VECTORS 
In the original coordinate system 
(45) vec( B, - B) = vec[(A')7'(@, - @)I"] 
= [r ® ( A’)”']vec(@ ~@) 
= ((Pi.Ta) ® B22( Ap Az)(P- A?) ‘|vec(@ - 8). 
From (44) and (45) we obtain 
(46) & vecn( By — B)|vec( B, ~ B)]' 
— |r) @Ez7A(I,— A?) | 
+ (P03 @E7A(h- Ai)" AE z2| 
=[Fi0} @ B27] + (P50; @ B24 (I-A) | A E27 
= Lyx 8 Ezz — (UV) @ zz, A2%zz). 


If we define Q= Zy,V, = 2zzA,A,U—- A’)7! and =P, then B = OI’. 
We have 


(47) 0.(0'S3} 0) = 37 - Bizz Ar Aa Ez2, 

(48) MWS y,M) MW =F0, = 2; -00. 

Thus (46) can be written 

(49) & vec BE (vec BY)’ > By @ Bz7— [Bek — (MZ yy) | 
@|E,,- 2(0'3;}0) '0'). 


Theorem 13.7.1. Let (X‘’, X®’)’, a=1,...,n, be observations on the 
random vector x, with mean 0 and covariance matrix 2. Let B= 212299 Let 
the columns of Y; satisfy (1) and ¥,,>0. Suppose that X — BX® = Z is 
independent of X ©. Then the limiting distribution of vec B* = Vn vec( B, — B), 
with B, = § Se tr", is normal with mean 0 and Covariance matrix (46) or (49). 


Note that B= QI’=QM'(11M~')’ for arbitrary nonsingular M; how- 
ever, (47) and (48) are invariant with respect to the transform ation Q —- OM 
and II > I1M~'. Thus (49) holds for any factorization B = QT’. 


13.8 ELLIPFICALLY CONTOURED DISTRIBUTIONS 563 


The limiting distribution of B* only depends on yn S$z,Sq! = 
(A’)"'sSt,S;}0’ and hence holds under the same conditions as the asymp- 
totic normality of the least syuares estimator B. 

Now suppose that X@ =x, @=1,...,n, is nonstochastic and that (19) 
holds. The model is (2); in the transformed coordinates [U = A’,x"?, v, = 
Vx, W=A',Z, @= A’,B(T))7! =A] the model is (3). H,=S'N, satis- 
fies (34) and (37). Again (39) holds. Further, (42) and (43) hold with V, = v, 
nonstochastic. 


Corollary 13.7.1. Let x(?,...,x@ be a set of vectors such that (19) holds, 
Let x = Bx +z, a=1,...,n, where z, is an observation on a random 
vector Z with €Z=0 and €ZZ’ =%X77. Suppose B has rank k. Then the 
limiting distribution of ¥n vec( B, — B) is normal with mean 0 and covariance 
(46) or (49). 


13.8. ELLIPTICALLY CONTOURED DISTRIBUTIONS 


13.8.1. Observations Elliptically Contoured 


Let x,,...,2%,) be N observations on a random vector X with density 
(1) [Wl ig[(x~v)'Bo'(x—v)], 


where W is a positive definite matrix, R*=(x—v)’W"'(x—vy), and 
€R* <co. Define x=p&R*/[(ER*)*(p +2]—-1. Then &X=v=yp and 
&(X — vX{X — v)’ =(&R*/p) WV =X. Define ¥ and S as the sample mean 
and covariance matrix. Define the orthogonal matrices B and B znd the 
diagonal matrices A and L by 


(2) > = BAB’, S = BLB’, 
Ap> on >A, E> OL, By 20, by 20, F= 1,..., p. As in Section 13.5.1, 
define T= B’SB=YLY’, where Y=B’B is orthogonal and y,,>0. Then 


&€T=B’=B=A. 
The limiting covariances of YN ve('S — %) and YN vec(T — A) are 


(3) lim N& vec(S — X)[vec( S— =)]’ 
N»& 


=(k+1)\(2+K,,)(2 8%) + « vec X (vec Sd)’, 
(4) lim Né veo(T ~ A) [vee( T - A)]' 


=(«+1)(1,2+K,,) + « vec E, (vec 1). 


564 THE DISTRIBUTIONS OF CHARACTERISTIC ROOTS AND VECTORS 


In terms of components &t,, = A,6,; and 


(5) lim, N é (tis —A,6))( ter — Ag Ser) 


= (K+ 1)(AA; 8, 5) + A, Ay 6 8 ye) + KA; A, 5,; Oxy 


Let VN(T—A)=U, VN(L-A)=D, and VN(Y-1,)=W. The set 
Uj|,-.-,U,, are asymptotically independent of the set Cesc: + Uy, ps the 
covariances u,,, ij, are mutually independent with variances (« + 1)A, Aj; 
the variance of u;,=d,; converges to (3« + 2)A?; the covariance of u,,=d, 
and u,, =d,, b+ k, converges to «A,A,. The limiting distribution of w,,, i #j, 
is the limiting distribution of u;,/(A, — A;). Thus the w,,, i<j, are asymptoti- 
cally mutually independent with &wj =(« + 1)A,A,/(A, — A,)’. These vari- 
ances and covariances for the normal case hold for « = 0. 


Theorem 13.8.1. Define diagonal A and L and orthogonal B and B by (2), 
A> DA, > >L, B, 20, b, 20, i=1,..., p. Define G = VN(B- 
B) and diagonal D = VN(L — A). Then the limiting distribution of G and D is 
normal with G and D independent. The variance of d; is (2+3«)A?, and the 
covariance of d; and d,, is kA; A,. The covanance of g, is 


P 
(6) WO (g,)=(1+ see re = Tee BB. 


The covariance matrix of g, and g, is 


A.A 
(7) AGC (8,,8,) = —(1+x)-——-S BB, i#j. 


(A - §) 


Proof. The proof is the same as for Theorem 13.5.1 except that (4) is used 
instead of (4) with x = 0. | 


In Section 11.7.3 we used the asymptotic distribution of the smallest q 
sample roots when the smallest g population roots are equal. Let A= 
diag( A, A*J,), where the diagonal elements of (diagonal) A, are different 
and are qaieer than A*. As before, let U= VN(T — A), and let Us, be the 
lower right-hand q Xq submatrix of U. Let D, and Y. be the lower 
right-hand qg <q submatrices of D and Y. It was shown in Section 13.5.2 that 
Un = Yy D,Y¥y, + 0,(1). 


13.8 ELLIPTICALLY CONTOURED DISTRIBUTIONS 565 


The criterion for testing the null hypothesis A Ste = Als 


p~qtt 
Pp 
We pag L, 


8 ee a 
(Ee ceik a) 


In Section 11.7.3 it was shown that —N times the logarithm of (8) has the 
limiting distribution of 


y oad Eu} 
2 u> + u>— — iw, { |. 
ors i=p- q+) : t=p-qtl q r=p-qril 

<< 
The term Lj <,uj, has the limiting distribution of (1 + x)A* ie ye The 


limiting distribution of (u,_4-1, p-g41> +1 Upp) is normal with mean 0 and 
covariance matrix Raped +), +xee'lX**, The limiting distribution of 
[Du2,—(Lu,,)?/q]a** is 21 + K)A*? ey Hence, the limiting distribution of 
(9) is the distribution of (1 + K) Xig41/2-1° 

We are also interested in the characteristic roots and vectors of one 


covariance matrix in the metric of another covariance matrix. 


Theorem 13.8.2. Let S* be the sample covariance matrix of a sample of size 
M from (1), and let T* be the sample covariance matrix of a sample of size N 
from (1) with W replaced by &. Let A be the diagonal matrix with Ay > >A, 
(> 0) as the diagonal elements, where 2,,..., A, ar? the roots of |W — AX| = 0. 
Let T=(y,,.--,,) be the matrix with y, the solution of (W~A,X)y = 0, 
y'Zy=1, and y,,20. Let X* =(27,...,43) and diagonal L* consist of the 
solutions to 


(10) (S* —IT* )x* =0, 
x*'T*x* = 1, andx} >0. AsM >, N-> co, M/N > 0, the limiting distribu- 


tion of Z* = YN(X* —T) and diagonal D* = VN(L— A) is normal with the 
following covanances: 


(11) 'V(d,) = (243) B24, 


; 1+ 
(12) .f@(d;,d,) = 


566 THE DISTRIBUTIONS OF CHARACTERISTIC ROOTS AND VECTORS 


A ACA tM), 243K 


(13) Ye(z)=(Lte) Ue 2 Ve¥« Wis 

het (A, —A)) : 

Art 

2+3 
(14) VC (d,.z,) = is oY, 
; ACL +n ,.«K Snip 3 
(15) Vt tes) = SEL) J eres WV tT AVY), L#], 
nla, —A, 


(16) we (d,.2z,)=5A4,.- 

Proof. Transform S* and T* to S=T’S*VT and T=L’T*T, ® and & 
to A=T’'@T and 1=T’ST, and X* to X=[7'X*=G"!. Let D= 
VN(L-A), H=V¥N(G-1). U=VN(S—A), and V=VN(T-J). The 
matrices U and V and D and H have limiting normal distributions; they are 
related by (20), (21), and (22) of Section 13.6. From there and the covariances 
of the limiting distributions we derive (11) to (16). a 


13.8.2. Elliptically Contoured Matrix Distributions 


Let ¥ (p XN) have the density g(tr YY’). Then A=YY’ has the density 
(Lemma 13.3.1) 


a | Al YN+ pol ) 
(17) CEN) 


g(r 4). 


Let A = BLB’, where L is diagonal with diagonal elements 1, > --* >/, and 
B is orthogonal with b,, > 0. Since g(tr A) = g(ZP_,1,), the density of /),..., 4, 
is (Theorem 13.3.4) 


mete aad 
(18) ee A ay, 
Vi(ap) 


and the matrix B is independently distributed according to the conditional 
Haar invariant distribution. 


Suppose Y* (p Xm) and Z* (p Xx) have the density 
Wor? el (Ye W'y* +Z*' WZ*)| (m,n>p). 
Let C be a matrix such that CWC’ =7. Then Y=CY* and Z=CZ* have 


the density gltr(yY’ + ZZ')|. Let A* =¥*Y*', BY =Z*Z*', A=YY', and 
B=2Z'. The roots of |A* —/B*| =0 are the roots of (A —/B| =0. Let the 


PROBLEMS 567 


roots of |A—f(A+B)| =0 be f,>--- >f,, and let F=diag(f,,..., f,). 
Define E (p X p) by A+ B=E'E, and A=E'FE, and e, >0,i=1,..., p. 


Theorem 13.8.3. The matrices E and F are independent. The densit: of F is 


aT [i(m+n)] i ae Geet) 
CO) TBD Gmytg Cay A PO 


the density of E is 


2°T, (dp )arintm—P) 


20 ae 
( ) vim +n 2a T [F(m+n)| 


|E’E| 0" +"-P) 9 (tr E'E). 


In the development in Section 13.2 the observations Y, Z have the density 
(21) (27) - p(n +m) en (y'y+Z'Z) (27) — pln+m) e7 (A+B) 


and in Section 13.7 g[tr(Y’¥ + Z'Z)j = g[tr(A +.B)]. The distribution of the 
roots does not depend on the form of g(-); the distribution of E depends 
only on E'E =A +B. The algebra in Section 13.2 carries over to this more 
general case. 


PROBLEMS 


13.1. (Sec. 13.2) Prove Theorem 13.2.1 for p=2 by calculating the Jacobian 
directly. 


13.2. (Sec. 13.2) Prove Theorem 13.3.2 for p=2 directly by representing the 
orthogonal matrix C in terms of the cosine and sine of an angle. 


13.3. (Sec. 13.2) Consider the distribution of the roots of |4—/B| =Q when A and 
B are of order two and are distributed according to W(Z,m) and W(X,xn), 
respectively. 


(a) Find the distribution of the larger root. 
(b) Find the distribution of the smaller root. 
(c) Find the distribution of the sum of the roots. 


13.4. (Sec. 13.2) Prove that the Jacobian | 0(G, A)/dCE, F)| is TICf, -f,) times a 
function of E by showing that the Jacobian vanishes for f,=/f, and that its 
degree in f, is the same as that of II(f,—f,). 


13.5. (Sec. 13.3) Give the Haar invariant distribution explicitly for the 2 x 2 orthog- 
onal matrix represented in terms of the cosine and sine of an angle. 


568 


13.6. 


13.7. 


13.8. 


13.9. 


13.10. 


13.11. 


13.12. 


THE DISTRIBUTIONS OF CHARAC] ERISTIC ROOTS AND VECTORS 


(Sec. 13.3) Let A and B be distributed according to W(Z, m) and W(X, n) 
tespectively. Let 1; > --- >, be the roots of |4 ~/B] =Q and m,>-- >m 
be the roots of |4 —mZ| =0. Find the distribution of the m’s from that of the 
!’s by letting 1 —> 0c. 


(Sec. 13.3) Prove Lemma 13.3.1 in as much detail as Theorem 13.3.1. 


Let A be distributed according to W(Z,n). In case of p =2 find the distribu- 
tion of the characteristic roots of A. [ Hint: Transform so that Z goes into a 
diagonal matrix. ] 


From the result in Problem 13.6 find the distribution of the sphericity criterion 
(when the null hypothesis is not true). 


(Sec. 13.3) Show that X (p Xn) has the density f,(X'X) if and only if T has 
the density 


2 Pap pr /2 P 


1507 LL-H) 


where T is the lower triangular matrix with positive diagonal elements such 
that 77’ =X'X. [Srivastava and Khatri (1979)}, [Hint: Compare Lemma 13,3.1 
with Corollary 7.2.1,] 


(Sec. 13.5.2) In the case that the covariance matrix is (12) find the limiting 
distribution of D,, W,,, Wrz, and W.. 


(Sec. 13,3) Prove (6) of Section 12.4. 


CHAPTER 14 


Factor Analysis 


14.1. INTRODUCTION 


Factor analysis is based oa a model in which the observed vector is parti- 
tioned into an unobserved systematic part and an unobserved error part. The 
components of the error vector are considered as uncorrelated or indepen- 
dent, while the systematic part is taken as a linear combination of a relatively 
smal] number of unobserved factor variables. The analysis separates the 
effects of the factors, which are of busic interest, from the errors. From 
another point of view the analysis gives a description or explanation of the 
interdependence of a set of variables in terms of the factors without regard to 
th2 observed variability, This approach is to be compared with principal 
component analysis, which describes or “explains” the variability observed. 
Factor analysis was developed originally for the analysis of scores on mental 
tests; however, the methods are useful in a much wider range of situations, 
for example, analyzing sets of tests of attitudes, sets of physical measure- 
ments, and sets of economic quantities. When a battery of tests is given to a 
group of individuals, it is observed that the score of an individual on a given 
test is more related to his scores on other tests than to the scores of other 
individuals on the other tests; that is, usually the scores for any particular 
individual are interrelated to some degree. This interrelation is “explained” 
oy considering a test score of an individual as made up of a part which is 
peculiar to this particular test (called error) and a part which is a function of 
more fundamental quantities called scores of primary abilities or factor scores. 
Since they enter several test scores, it is their effect that connects the various 


An Introduction ta Multivariate Statistical Analysis, Third Edition, By T. W. Anderson 
ISBN 0:471.36091-0 Copyright © 2003 John Wiley & Sons. Inc. 


570 FACTOR ANALYSIS 


test scores. Roughly, the idea is that a person who is more intelligent in some 
respects will do better on many tests than someone who is less intelligent. 

The model for factor analysis is defined and discussed in Section 14,2, 
Maximum likelihood estimators of the parameters are derived in the case 
that the factor scores and errors are normally distributed, and a test that the 
model fits is developed. The large-sample distribution theory is given for the 
estimators and test criterion (Section 14,3). Maximum likelihood estimators 
for fixed factors do not exist, but alternative estimation procedures are 
suggested (Section 14.4). Some aspects of interpretation are treated in 
Section 14.5. The maximum likelihood estimators are derived when the 
factors are normal and identification is effected by specified zero loadings. 
Finally the estimation of factor scores is considered. Anderson (1984a) 
discusses the relationship of factor analysis to principal components and 
linear functional and structural relationships. 


14.2. THE MODEL 


14.2.1. Definition of the Model 


Let the observable vector X be written as 
(1) X=Af+U+p, 


where X, U, and w are column vectors of p components, f is a column 
vector of m (<p) components, and A is a p X m matrix. We assume that U 
is distributed independently of f and with mean @U=0 and covariance 
matrix @UU’ =, which is diagonal. The vector f will be treated alterna- 
tively as a random vector and as a vector of parameters that varies from 
observation to observation, 

In terms of mental tests each component of X is a score on a test or 
battery of tests. The corresponding component of p is the average score of 
this test in the population. The components of f are the scores of the mental 
factors; linear combinations of these enter into the test scores. The coeffi- 
cients of these linear combinations are the elements of A, and these are 
called factor loadings. Sometimes the elements of f are called common 
factors because they are common to several different tests; in the first 
presentation of this kind of model [Spearman (1904)] f consisted of one 
component and was termed the general factor. A component of U is the part 
of the test score not “explained’’ by the common factors. This is considered as 
made up of the error of measurement in the test plus a specific factor, having 
to do only with this particular test. Since in our model (with one set of 
observations on each individual) we cannot distinguish between these two 


142 THE MODEL 571 


components of the coordinate of U, we shall simply term the element of U 
the error of measurement. 

The specification of a given component of X is similar to that in regres- 
sion theory (or analysis of variance) in that it is a linear combination of other 
variables. Here, however, f, which plays the role of the independent variable, 
is not observed. 

We can distinguish between two kinds of models. In one we consider the 
vector f to be a random vector, and in the other we consider f to be a vector 
of nonrandem quantities that varies from one individual to another. In the 
second case, it is more accurate to write X, =Af, +U + w. The nonrandom 
factor score vector may seem a better description of the systematic part, but 
it poses problems of inference because the likelihood function may not have 
a maximum. In principle, the model with random factors is appropriate when 
different samples consist of different individuals; the nonrandom factor 
model is suitable when the specific individuals involved and not just the 
structure are of interest. 

When f is taken as random, we assume éf=0. (Otherwise, &X= 
A &éf +, and w can be redefined to absorb A éf.) Let &ff'=®. Our 
analysis will be made in terms of first and second moments. Usually, we shall 
consider f and U to have normal distributions. If f is not random, then 
f =f, for the ath individual. Then we shall assume usually (1/N)L_, f, = 0 
and (1/N)LN_ Sf, f, = ®. 

There is a fundamental indeterminacy in this model. Let f=Cf* (f* = 
C7f) and A* = AC, where C is a nonsingular m X m matrix. Then (1) can 
be written as 


(2) X=A*f*4+U 4p. 


When f is random, &f*f*’ =C”'®(C"')’ = @*: when f is nonrandom, 
(1/N)LN_ f*f*' = ®*. The model with A and f is equivalent to the model 
with A* and f*; that is, by observing X we cannot distinguish between these 
two models. 

Some of the indeterminacy in the model can be eliminated by requiring 
that ff’ =J if f is random, or LY_,f, f? = NI if f is not random. In this 
case the factors are said to be orthogonal; if ® is not diagonal, the factors 
are said to be oblique. When we assume ® = J, then &f*f*’=C7'(C"')’ =7 
(J = CC’). The indeterminacy is equivalent to multiplication by an orthogonal 
matrix; this is called the problem of rotation. Requiring that ® be diagonal 
means that the components of f are independently distributed when f is 
assumed normal. This has an appeal to psychologists because one idea of 
common mental factors is (by definition) that they are independent or 
uncorrelated quantities. 


572 FACTOR ANALYSIS 


A crucial assumption is that the components of U are uncorrelated. Our 
viewpoint is that the errors of observation and the specific factors are by 
definition uncorrelated. That is, the interrelationships of the test scores are 
caused by the common factors, and that is what we want to investigate. There 
is another point of view on factor analysis that is fundamentally quite 
different; that is, that the common factors are supposed to explain or account 
for as much of the variance of the test scores as possible. To follow this point 
of view, we should use a different model. 

A geometric picture helps the intuition. Consider a p-dimensional space. 
The columns of A can be considered as m vectors in this space. They span 
some m-dimensional subspace; in fact, they can be considered as coordinate 
axes in the m-dimensional space, and f can be considered as coordinates of 
a point in that space referred to this particular axis system. This subspace is 
called the factor space. Multiplying A on the right by a matrix corresponds to 
taking a new set of coordinate axes in the factor space. 

If the factors are random, the covariance matrix of the observed X is 


(3) Y=S(X-p)(X-p)' =S(ASHU)(ASTU)' =AGA' TD, 
If the factors are orthogonal ( &ff' = J), then (3) is 
(4) LH AA +P. 


If f and U are normal, all the information about the structure comes from 
(3) [or @)l and &X=yp. 


14.2.2. Identification 


Given a covariance matrix £ and a number m of factors, we can ask whether 
there exist a triplet A, ® positive definite, and W positive definite and 
diagonal to satisfy (3); if so, is the triplet unique? Since any triplet can be 
transformed into an equivalent structure AC, C7'®C'~', and W, we can 
put m? independent conditions on A and ® to rule out this indeterminacy. 
The number of components in the observable & and the number of condi- 
tions (for uniqueness) is $p( p +1) +7; the numbers of parameters in A, 
®, and W are pm, sm(m + 1), and p, respectively. If the excess of observed 
quantities and conditions over number of parameters, namely, ;[(p ~ m)? 
~—p — ml], is positive, we can expect a problem of existence but can anticipate 
uniqueness if a set of parameters does exist. If the excess is negative, we can 
expect existence but possibly not uniqueuess; if the excess is 0, we can hope 
for both existence and uniqueness (or at least a finite number of solutions). 
The question of existence of a solution is whether there exists a diagonal 


14.2 THE MODEL 573 


matrix W with nonnegative diagonal entries such that 2 —W is positive 
semidefinite of rank m. Anderson and Rubin (1956) include most of the 
kiiown results on this problem. 

If a solution exists and is unique, the model is said to be identified. As 
noted above, some m?” conditions have to be put on A and ® to eliminate a 
transformation A* = AC and ®* =C™'®C’'~'. We have referred above to 
the condition ®=J, which forces a transformation C to be orthogonal. 
[There are 3m(in + 1) component equations in ® = J.] For some purposes. it 
is convenient to add the restrictions that 


(5) P=A'wA 


is diagonal. If the diagonal elements of I are ordered and different (y,, > 
¥x2 > °** > Yum), A is uniquely determined. Alternative conditions are that 
the first m rows of A form a lower triangular matrix. A generalization of this 
condition is to require that the first m rows of BA form a lower triangular 
matrix, where B& is given in advance. (This condition is implied by the 
sc-called centroid method.) 


Simple Structure 

These are conditions proposed by Thurstone (1947, p. 335) for choosing a 
matrix out of the class AC that will have particular psychological meaning. If 
A;, = 0, then the ath factor does not enter into the th test. The general idea 
of simple structure is that many tests should not depend on all the factors 
when the factors have real psychological meaning. This suggests that. given a 
A, one should consider all rotations, that is, all matrices AC where C is 
orthogonal, and choose the one giving most 0 coefficients. This matrix can be 
considered as giving the simplest structure and presumably the one with most 
meaningful psychological interpretation. It should be remembered that the 
psychologist can construct his or her tests so that they depend on the 
assumed factors in different ways. 

The positions of the 0’s are not chosen in advance, but rotations C are 
tried until a A is found satisfying these conditions. It is not clear that these 
conditions effect identification. Reiersgl (1950) modified Thurstone’s condi- 
tions so that there is only one rotation that satisfies the conditions. thus 
effecting identification. 


Zero Elements in Specified Positions 

Here we consider a set of conditions that requires of the investigator more 
a priori information. He or she must know that some particular tests do not 
depend on some specific factors. In this case, the conditions are that A,, = 0 
for specified pairs (i, a); that is, that the ath factor does not affect the ith 


574 FACTOR ANALYSIS 


test score. Then we do not assume that &ff’ =f. These conditions are 
simmlar to some used in econometric models. The coefficients of the ath 
column are identified except for multiplication by a scale factor if (a) there 
are at least 1 — 1 zero elements in that column and if (b) the rank of A? is 
m~1. where A‘*) is the matrix composed of the rows containing the 
assigned 0's in the ath column with those assigned 0’s deleted (i.e., the ath 
column deleted), (See Problem 14.1.) The multiplication of a column by a 
scale constant can be eliminated by a normalization, such as ¢,,=1 or 
A,,=1 for some i for each a. If ¢,=1, a=1,...,m, then ® is a 
correlation matrix. 

It will be seen that there are m normalizations and a minimum of 
m({m — 1) zero conditions. This is equal to the number of elements of C. If 
there are more than 1 — | zero elements specified in one or more columns 
of A. then there may be more conditions than are required to take out the 
indeterminacy in AC; in this case the conditions may restrict A®A’. 

As an example, consider the model 


| 0 
An 0 
(6) X=pt] az, Ax joj +u 
QO Ag 
0 1 
v 
AnD 
Aga 
a 


for the scores on five tests, where v and a are measures of verbal and 
arithmetic ability, The first two tests are specified to depend only on verbal 
ability while the last two tests depend only on arithmetic ability. The 
normalizations put verbal ability into the scale of the first test and arithmetic 
ability into the scale of the fifth test. 

Koopmans and Reiers¢gl (1950), Anderson and Rubin (1956), and Howe 
(1955) suggested the use of preassigned 0’s for identification and developed 
maximum likelihood estimation under normality for this case. [See also 
Lawley (1958).] Jéreskog (1969) called factor analysis under these identifica- 
tion conditions confirmatory factor analysis; with arbitrary conditions or with 
rotation to simple structure, it has been called exploratory factor analysis. 


14.2. THE MODEL 575 


Other Conditions 

A convenient set of conditions is to require the upper square submatrix of A 
to be the identity. This assumcs that the upper square matrix without this 
condition is nonsingular. In fact, if A* =(A%’, A¥4’)’ is an arbitrary p Xm 
matrix with A*% square and nonsingular, then A= A*A*”| =(J,,, A’,)’ satis- 
fies the condition. (This specification of the leading m Xm submatrix of A 
as £,, IS a convenient identification condition and does not imply any 
substantive meaning.) 


14.2.3. Units of Measurement 


We have considered factor analysis methods applied to covariance matrices. 
In many cases the unit of measurement of each component of X is arbitrary. 
For instance, in psychological tests the unit of scoring has no intrinsic 
meaning. 

Changing the units of measurement means multiplying each component of 
X by a constant; these constants are not necessarily equal. When a given test 
score is multiplied by a constant, the factor loadings for the test are 
multiplied by the same constant and the crror varrance is multiplied by 
square of the constant. Suppose DX = X*, where D is a diagonal matrix with 
positive diagonal elements. Then (1) becomes 


(7) X*=Atf+U* + pt, 


where p* = €X*=Dyp, A*=DA, and U*=DU has covariance matrix 
wWw* = DWD. Then 


(8) &(X* — ph) (X* — wt) = ARAN 4 WF = EF, 


where £* = DD. Note that if the identification conditions are ® =J and 
A'W"'A diagonal, then A* satisfies the latter condition. If A is identified 
by specified 0’s and the normalization is by ¢,, = 1, a=1,...,m(ie., ® is a 
correlation matrix), then A* = DA is similarly identified. (If the normaliza- 
tion is A;,= 1 for specified 7 for cach a, each column of DA has to be 
renormalized. ) 

A particular diagonal matrix D consists of the reciprocals of the observ- 
able standard deviations d,,= |/ya,,. Then %* = DYD is the correlation 
matrix. 

We shall see later that the maximum likelihood cstimators with identifica- 
t.on conditions FP diagonal or specified 0’s transform in the above fashion; 
that is, the transformation x* = Dx,, a=1,...,.N, induces A*=DA and 
W* = DW. 


576 FACTOR ANALYS]S 


14.3. MAXIMUM LIKELIHOOD ESTIMATORS FOR RANDOM 
ORTHOGONAL FACTORS 


14.3.1. Maximum Likelihood Estimators 


In this section we find the maximum likelihood estimators of the parameters 
when the observations are normally distributed, that is, the factor scores and 
errors are normal [Lawley (1940)], Then & = APA’ + W. We impose condi- 
tions on A and ® to make them just identified. These do not restrict 
A@®A‘; it is a positive definite matrix of rank m. For convenience we 
suppose that @ = J (i.e., the factors are orthogonal or uncorrelated) and that 
.'=A'W!A is diagonal. Then the likelihood depends on the mean w and 
x= AA'+ W. The maximum likelihood estimators of A and ® under some 
other conditions effecting just identificztion [e.g., A = (Z,,, A,)'] are trans- 
formations of the maximum likelihood estimators of A under the preceding 
conditions. If x,,...,x, are a set of N observations on X, the likelihood 
function for this sample is 


\ N 
(1) L= (2m) 117% exp| -2 ¥ (x, - wy (x, - w) |. 
a=) 


The maximum likelihood estimator of the mean p is p= x¥=(1/N)LN_,x,. 
Let 


N 
(2) Am Y (x4 ¥)(4,~ 3)" 


Next we shall maximize the logarithm of (1) with w replaced by ja; this ist 
(3) ~1pN log2a— 4N log|S| — ttradq!. 


(This is the logarithm of the concentrated likelihood.) From %%~! =I, we 
obtain for any parameter 8 


le i eee 
(4) apo -% “35 > F 
Then the partial derivative of (3) with regard to ¢%,, a diagonal element of 
W, is ~N/2 times 


p 
(5) cl Y e,alo', 
k,y=0 


*We could add the restriction that the off-diagonal elements of A’W~'A are 0 with Lagrange 
multiptiers, but then the Lagrange multipliers become 0 wheu the derivatives are set equal to 0. 
Such restrictions do not affect the maximum. 


143. ESTIMATORS FOR RANDOM ORTHOGONAL FACTORS 577 


where 2! =(o/) and (c,,)= C= (1/N)A. In matrix notation, (5) set equal 
to 0 yields 


(6) diag X~' =diag Y~'CX"'," 


where diag H indicates the diagonal terms of the matrix H. Equivalently 
diag &~'( ~ C)X~! = diag 0. The derivative of (3) with respect to A,. is —N 
times 


p p 
* k kh : = 
(7) yaa, — yo«@ Chg 7 *A,,, kK=1,....p, 7T=1,...,m. 
j=l Aygj=l 


In matrix notation (7) set equal to 0 yields 


(8) > ‘A= 'CS A, 
We have 
(9) LTWlA=(AA'+H)W'A=APT+A=A(P 47), 


From this we obtain W~'A(T +7)~! = &7'A. Multiply (8) by { and use the 
above to obtain 


(10) A(T +r) =Cw'a, 
or 
(11) (C-W)W'A=AT. 


Next we want to show that L7'-Y CL? =L-(L~C)E”! is 
w-!(2-—C)W~! when (8) holds, Multiply the latter by © on the left and on 
the right to obtain : 


(12) 

rw y-—C)ywlry=(AA’ +W)WUCW+ AA'-C)W (AA +W) 
=P+AA'-C 

because 


(13) AAW EAAT-C)=AA'H+ATA'-AA'W IC 
=A[(I+T)A'- AWC] 
=0 

by virtue of (10). Thus 


(14) SUZ-C)S I HaW(l-C)w! 


578 FACTOR ANALYSIS 


Then (6) is equivalent to diag W~'(E - C)W! = diag0. Since W is diago- 
nal. this equation is equivalent to 


(15) diag( AA’ + W) = diag C. 


The estimators A and W are determined by (10), (15), and the requiremert 
that A'Wo?A is diagonal. 
We can multiply (11) on the left by We} to obtain 


(16) Wr C- WWW FA) =(WUAYL, 


which shows that the columns of W~!A are characteristic vectors of 
WC W)w- t= wo iCw-t-7 and the corresponding diagonal ele- 
ments of I’ are the characteristic roots. {In fact, the characteristic vectors of 
w-icw-!-y are the characteristic vectors of W~ ?CW-? because 
(WCW -I)x= yx is equivalent to Wo Cw y=(14 y)x.] The vec- 
tors are normalized by (W™ FA)'C™ 7A)=A'W'A =I. The characteristic 
roots ale chosen to maximize the likelihood. To evaluate the maximized 
likelihood function we calculate 


(17) trrCS"'=trcl'(S-A A) ee 
Cc 


1 
op) 
=> 

I 


The third equality follows from (8) multiplied on the left by 3; the fourth 
equality follows from (15) and the fact that ‘¥ is diagonal. Next we find 


(18) IS) alli FA Are er) 
all lA bin sy, 
lel |P+7,,| 

P . i 
14 T1G4+8 


The second equality is |UU' + 1,| =|U'U +1,,| for U p X m, which is proved 


as in (14) of Section 8.4. From the fact that the characteristic roots of 


143 ESTIMATORS FOR RANDOM ORTHOGONAL FACTORS 579 


W- (C-W)W-? are the roots y, > 2 > °° > y% of O=|C-W- yW| = 
IC-(1+y) Wl, 


Ic. P 
(19) rs =I1G+%). 


[Note that the roots 1+ y, of W~?CW~? are positive. The roots y, of 
! . ege : . 
W- C— W)W"? are not necessarily positive; usually some will be negative.] 

Then 


po lOMes +5) ta 
12.,(1+4,) Tyes(1 + 4) 


(20) 


where S is the set of indices corresponding to the roots in I’. The logarithm 
of the maximized likelihood function is 


(21) ~ ZpN log2m— 3N loglC| — N ¥ log(1 + ¥,) — zNp. 
jes 


The largest roots 7, > --- > ¥,, should be selected for diagonal elements of 
f. Then S=(1,..., m}. The logarithm of the concentrated likelihood (3) is a 
function of 2=AA'+W, This matrix is positive definite for every A and 
every diagonal W that is positive definite; it is also positive definite for some 
diagonal W’s that are not positive definite. Hence there is not necessarily a 
relative maximum for W positive definite. The concentrated likelihood 
function may increase as one or more diagonal elements of W approaches 0. 
In that case the derivative equations may not be satisfied for W positive 
definite. 

The equations for the estiniators (11) and (15) can be written as polyno- 
mial equations [multiplying (11) by |W], but cannot be solved directly. There 
are various iterative procedures for finding a maximum of the likelihood 
function, including steepest descent, Newton—Raphson, scoring (using the 
information matrix), and Fletcher—Powell. [See Lawley and Maxwell (1971), 
Appendix II, for a discussion.] 

Since there may not be a relative maximum in the region for which y,; > 0, 
i=1,..., p, an iterative procedure may define a sequence of values of A and 
W that includes hy < 0 for some indices i. Such negative values are inadmis- 
sible because yw, is interpreted as the variance of an error. One may impose 
the condition that y,, 2 0,i=1,..., p. Then the maximum may occur on the 
boundary (and not all of the derivative equations will be satisfied). For some 
indices / the estimated variance of the error is 0; that is, some test scores are 
exactly lincar combinations of factor scores. If the identification conditions 


580 FACTOR ANALYSIS 


®=J and A’W~'A diagonal are dropped, we can find a coordinate system 
for the factors such that the test scores with 0 error variance can be 
interpreted as (transformed) factor scores. That interpretation does not seem 
useful. [See Lawley and Maxwell (1971) for further discussion. ] 

An alternative to requiring y,; to be positive is to require y,, to be 
bounded away from 0. A possibility is y,,; > ¢0,, for some small ¢, such as 
0.005. Of course, the value of ¢ is arbitrary; increasing e¢ will decrease the 
value of the maximum if the maximum is not in the interior of the restricted 
region, and the derivative equations will not all be satisfied. 

The nature of the concentrated likelihood is such that more than one 
relative maximum may be possible. Which maximum an iterative procedure 
approaches will depend on the initia] values. Rubin and Thayer (1982) have 
given an example of three sets of estimates from three different initial 
estimates using the EM algorithm. 

The EM (expectation—maximization) algorithm is a possible computational 
device for maximum likelihood estimation [Dempster, Laird, and Rubin 
(1977), Rubin and Thayer (1982)]. The idea is to treat the unobservable f’s as 
missing data. Under the assumption that f and U have a joint normal 
distribution, the sufficient statistics are the means and covariances of the X’s 
and f’s. The E-step of the algorithm is to obtain the expectation of the 
covariances on the basis of trial values of the parameters. The M-step is to 
maximize the likelihood function on the basis of these covariances; this step 
provides updated values of the parameters. The steps alternate, and the 
procedure usually converges to the maximum likelihood estimators. (See 
Problem 14.3.) 

As noted in Section 14.2, the structure is equivariant and the factor scores 
are invariant under changes in the units of measurement of the observed 
variables X > DX, where D is a diagonal matrix with positive diagonal 
elements and A is identified by A'W~'A is diagonal. If we let DA = A*, 
DY¥D=W*, and DCD = C*, then the logarithm of the likelihood function is 
a constant plus a constant times 


(22) —log| W* + A*A*’| = tr CCW * 4 AFAR) 
= —loghW +AA’|—trC(W +AA’) | — 2]ogiDI. 


The maximum likelihood estimators of A* and W* are A*=DA and 
* =D WD, and At fee-! At = AWA is diagonal. Thai is, the estimated 
factor loadings and error variances are merely changed by the units of 
measurement. 

It is often convenient to use d,,=1/yc,,, so DCD = (7;,) is made up of 
the sample correlation coefficients. The analysis is independent of the units 
of measurement. This fact is related to the fact that psychological test scores 
do not have natural units. 


143. ESTIMATORS FOR RANDOM ORTHOGONAL FACTORS 581 


The fact that the factors do not depend on the location and seale factors is 
onc reason for considering factor analysis as an analysis of interdependence, 
it is convenient to give some rules of thumb for initial estimates of the 


” 


coinmuniulities, ie aj, = 1-1, in terms of observed correlations, One rule 


yy 


. 9 . 
is to use the Roy iet.yaa,....° Another is to usc max, , |r. 


14.3.2. Test of the Hypothesis That the Model Fits 


We shall derive the likelihood ratio test that the model fits: that is. that for a 
specified m the covariance matrix can be written as 3 = ¥ +A A’ for some 


diagonal positive definite W and some p x m matrix A. The likelihood ratio 
criterion: is 


max Lip, BW +AA! 3 BP ly 
(23) w.ALF Cu ) = : IC : : = I] (i 4 4)" 
max,» L(p, 2) iW+A Al} j=mtt 


because the unrestricted maximum likelihood estimator of ¥ is C, tr C(W + 
AA)' =p by (17), and [C] 12] =f... + %)* from (20). The nul 


hypothesis is rejected if (23) is too small. We can use —2 times the logarithm 
of the likelihood ratio criterion: 


p 
(2A) -N YI log(i +4) 


jemel 


and reject the null hypothesis if (24) is too large. 

If the regularity conditions for and A to be asymptotically normally 
distributed hold, the limiting distribution of (24) under the null hypothesis is 
x* with degrees of freedom 4[( p ~ 1)? — p- ml], which is the number of 
elements of & plus the number of identifying restrictions minus the number 
of parameters in W and A. Bartlett (1950) suggested replacing N by’ 
N—(2p + 11)/6— 21/3. See also Amemiya and Anderson (1990), 

From (15) and the fact that ¥,,...,, are the characteristi: roots of 
wh (CC — Wh)? we have 


(25) O=tb(c-B-AA) be: 
anh (c-W)b  -u baa: 
eth (c-b)ywi-tf 
P m p 


Ih 
“1 
=~ 

\ 
“1 
~> 

" 
M 
~> 


"This factor is heuristic. If m= 0. the factor from Chapter 9 is NV - (2p +11)/6: Banlen 
suggested replacing N and p by N—™m and p— 1, respectively. 


582 FACTOR ANALYSIS 


If |¥| <1 for j=m+1....,p, we can expand (24) using (25) as 


Pp Pp 
(26) -N OY (5-447 445 Jaan YG ¥ +): 
p=mrl j=mt+l 


The criterion is approximately 3NLP. mer dy The estimators W and A are 

found so that C-&—A A’ is small in a statistical sense or, equivalently, 

so C-W is approximately of rank m. Then the smallest p—m roots of 

hb (C- #)- = should be near 0. The critzrion measures the deviations 

“4 these roots from 0. Since ¥,,415-.-,% are the nonzero roots of 
-(C—%)W- +, we see that 


132 


(27) lif woyc-2)e-? 


trl 
tl 
Is 
> 
= 
~] 
i 


i 


tir bo ¢c- 3) #-1(C-&) 


o ~ (¢,, a é,) 
i<j YB), 


because the diagonal elements of C — ¥ are 0. 

In many situations the investigator does not know a value of m to 
hypothesize, He or she wants to determine the smallest number of factors 
such that the model is consistent with the data. It is customary to test 
successive values of 1, The investigator starts with a test that the number of 
factors is a specified m,) (possibly 0 or 1), If that hypothesis is rejected, one 
proceeds to test that the number is m,) +1. One continues in that fashion 
until a hypothesis is accepted or until ${( p ~ m)? - p — m) <0. In the last 
event one concludes that no nontrivial factor model fits, Unfortunately, 
the probubilities of errors under this procedure are unknown, even asymptot- 
ically. 


14.3.3. Asymptotic Distributions of the Estimators 


The maximum likelihood estimators A and & maximize the average concen- 
trated log likelihood functions L*(C, A*, ¥*) given by (3) divided by N for 
x” = W* + A*A*’, subject to A*'W*"'A* being diagonal. If C is a consis- 
tent estimator of £ (the “true” covariance matrix), then L*(C, A*, B*)—> 
L*(W +A A’, A*, *) uniformly in probability in a neighborhood of A, W, 
and L*(W+ AA’, A*, W*) has a unique maximum at W* = W and A*= A. 
Because the function is continuous, the A*, #W* that maximize 
L*(C, A*,.W*) must converge stochastically to A, W. 


14.3. ESTIMATORS FOR RANDOM GRTHOGONAL FACTORS 583 


Theorem 143.1. If A and W are identified by A'V~'A being diagonal, if 
the diagonal elements are different and ordered, and if C>W+AA', then 
boawand ASA. 


A sufficient condition for C4 & is that (f’ U’) has a distribution with 
finite second-order moments. 

The estimators A and ® are the solutions to the equations (10), (15), and 
the requirement that A’W~'A is diagonal. These equations are polynomial 
equations. The derivatives of A and W as functions of C are continuous 
unless they become infinite. Anderson and Rubin (1956) investigated condi- 
tions for the derivative to be finite and proved the following theorem: 


Theorem 14.3.2. Let 
(28) (8,)=@=W-A(AW IA) A’. 


If (02) is nonsingular, if A and W are identified by the condition that A’W~'A 
is diagonal and the diagonal elements are different and ordered, ifC 5 W+A A’, 
and if VN(C—%) has a limiting normal distribution, then VN(A ~ A) and 
VN (& — WV) have a limiting normal! distribution. 


For example, YN (C — £) will have a limiting distribution if (f’ U’)' has a 
distribution with finite fourth moments. 

The covariance matrix of the limiting distribution of YN (A—A) and 
YN (& — W) is too complicated to derive or even present here. Lawley (1953) 
found covariances for YN(A— A) appropriate for W known, and Lawley 
(1967) extended his work to the case of WV estimated. [See also Lawley and 
Maxwell (1971).] Jennrich and Thayer (1973) corrected an error in his work. 

The covariance of VN (ig, — y,;) and YN (dg, — ¥,) in the limiting distribu- 
tion is 
(29) 2upupe”, if yeep, 


where (€!/) = Che The other covariances are too involved to give here. 

While the asymptotic covariances are too complicated to give insight into 
the sampling variability, they can be programmed for computation. In that 
case the parameters are replaced by their consistent estimators. 


14,3,4. Minimum-Distance Methods 


An alternative to maximum likelihood is generalized least squares. The 
estimators are the values of W and A that minimize 


(30) tr(C-3)H(C-%)H, 


584 FACTOR ANALYSIS 


where L=W+AA‘ and H= +7! or some consistent estimator of 7}. 
When H = 7!, the objective function is of the form 
(31) [e-a(W, A)]’[cove] '[e-o(, A)], 


where e represents the elements of C arranged in a vector, o(W, A) is 
W+AA’ arranged in a corresponding vector, and cove is the covariance 
matrix of c under normality [Anderson (1973a)]. Joreskog and Goldberger 
(1972) use C™! for H and minimize 

(32) u(C-¥)C '(C-¥)e! =r 1-2"). 


The matrix of derivatives with respect to the elements of A set equal to 0 
forms the matrix equation 


(33) c7'(C-X)c"'A=0. 
This can be rewritten as 
(34) A=Z3C7/A. 


Multiplication on the left by &7'CX7' yields (8., which leads to (10). This 
estimator of A given W is the same as the maximum likelihood estimator 
except for normalization of columns. The equation obtained by setting the 
derivatives of (32) with respect to W equal to 0 is 


(35) diag C7'[(W + AA’) ~ C]C™' = diago. 
An alternative is to minimize 
] nal 2 
(36) str{(W+AA')'[C-(W+AA)]}. 
This leads to (8) or (10) and 
(37) diag Z7'CX—“'(C— X) Xx! = diago. 


Browne (1974) showed that the generalized least squares estimator of W has 
the same asymptotic distribution as the maximum likelihood estimator. Dahm 
and Fuller (1981) showed that if cove in (31) is replaced by a matrix 
converging to cove and W, A, and ® depend on some parameters, then the 
asymptotic distributions are the same as for maximum likelihood. 


14.3.5. Relation to Principal Component Analysis 


What is the relation of maximum likelihood to the principal component 
analysis proposed by Hotelling (1933)? As explained in Chapter 11, the vector 
of sample principal components is the orthogonal transformation B’X, where 


143 ESTIMATORS FOR RANDOM ORTHOGONAL FACTORS 585 


th: columns of B are the characteristic vectors of C normalized by B'B = J. 
Then 


p 
(38) C=BTB'= 3} 1b, 
r=] 
where T is the diagonal matrix with diagonal elements Linea las the charac. 
teristic roots of C. If t,,,1,...,f, are small, C can be approximated by 
(39) B,T, B,= Yt, 5, 


r=} 


where TJ, is the diagonal matrix with diagonal elements f,,...,f,,, and X is 
approximated by 

a) 
(40) B,B,X= > b (BX). 

=] 
Then the sample covariance of the difference between X and the approxima- 
tion (40) is the sample covariance of 


which is B.T,B,= U7, ,,6,t,b), and the sum of the variances of the compo- 
nents is Lenzsif, Here 7, is the diagonal matrix with 1¢,,,),..../, as 
diagonal elements. 

This analysis is in terms of some common unit of measurement. The first 
m components “explain” a large proportion of the “variance,” tr C. When the 
units of measurement are not the same (e.g., when the units are arbitrary), 
it is cuStomary to standardize each measurement to (sample) variance 1. 
However, then the principal components do not have the interpretation in 
terms of variance. 

Another difference between principal component analysis and factor anal- 
ysis is that the former does not separate the error from the systematic part. 
This fault is easily remedied, however. Thomson (1934) proposed the follow- 
ing estimation procedure for the factor analysis model. A diagonal matrx W 
is subtracted from C, and the principal component analysis is carried out on 
C-~W. However, W is determined so C—W is close to rank m. The 
equations are 


(42) (C-W)A=AL, 
(43) diag( V+ AA) =diag C, 
(44) A’ A= L diagonal. 


The last equation is a normalization and takes out the indeterminacy in A. 
This method allows for the error terms, but still depends on the units of 


586 FACTOR ANALYSIS 


measurement. The estimators are consistent but not (asymptotically) efficient 
in the usual factor analysis model. 


14.3.6. The Centroid Method 


Before the availability of high-speed computers, the centroid method was 
used almost exclusively because of its computational ease. For the sake of 
history we give a sketch of the method. Let R* be the correlation reduced 
matrix, that is, the matrix consisting of r,,, i #j, and 1 — bt, where %* is an 
initial estimate of the error variance in standard deviation units. Thomson’s 
principal components approach is first to find the m characteristic vectors of 
R, = R* corresponding to the m largest characteristic roots. As indicated in 
Chapter 11, one computational method involves starting with an_ initial 
estimate of the first vector, say x, calculating x“) = Ry x, and iterating. At 
the rth step x“ is approximately y,x~, where y, is the largest root and 
KOI YD mw yF et DYED Then y, =x / yy, x07'x"") is approximately 
the first characteristic vector normalized so y\ y,; = y,. To ootain the second 
vector, apply the same procedure to R, = R* — yy}. 

The centroid method can be considered as a very rough approximation to 
the principal component approach. With psychological tests the correlation 
matrix usually consists of positive entries, and the first characteristic vector 
has all positive components, often of about the same valuc. The centroid 
method uses ¢ = (1,...,1)’ as the initial estimate of the first vector. Then 
R* ¢ =x" is the first iterate and should be an approximation to the first 
characteristic vector. An approximation to the first characteristic root is 
e'R*e /e'e. Then y, =x" /Ve'R* € is an approximation to the first charac- 
teristic vector of R* normalized to have length squared y,. The operations 
can be carried out on an adding machine or on a desk calculator because 
R*e amounts to adding across rows and e’R*e is the sum of those row 
totals. 

The second characteristic vector is orthogonal to the first. A vector 
orthogonal to € is e* consisting of p/2 l’sand p/2 —1’s. Then R,e* =x, is 
an approximation to the second characteristic vector, and ©*’R,e* /e*'e* 
approximates the second characteristic root. These operations involve chang- 
ing Signs of entries of R, and adding. The positions of the —1’s in ©* are 
selected to maximize e*'R,©&*. The procedure can be continued. 


14.4. ESTIMATION FOR FIXED FACTORS 


Let x, = Ctyas---s 45)’ be an observation on X, given by 


(1) X,=Af,+p+U, 


145 FACTOR INTERPRETATION AND TRANSFORMATION 587 


with f, being a nonstochastic vector (an incidental parameter), a = 1,..., N, 
satisfying L’_, f, = 0. The likelihoud function is 


e Xia an 4 rj ss 
= +, Tex} -5 © set a a 

[(2r) TR] cei f 
This likelihood function does not have a maximum. To show this fact, let 
2 =0, A, = 1, A =O GED, fata Then xy, - wy LL Ay fa = 9 
and &,, does not appear in the exponent but appears only in the constant. 
As gy, 20, L > 00. Thus the likelihood does not have a maximum, and the 
maximum likelihood estimators do not exist [Anderson and Rubin (1956)]. 
Lawley (1941) set the partial derivatives of the likelihood equal to 0, but 
Solari (1969) showed that the solution is only a stationary value, not a 
maximum. 

Since maximum lil.elihood estimators do not exist in the case of fixed 
factors, what estimation methods can be used? One possibility is to use the 
maximum likelihood method appropriate for random factors, It was stated by 
Anderson and Rubin (1956) and proved by Fuller, Pantula, and Amemiya 
(1982) in the case of identification by 0’s that the asymptotic normal distribu- 
tion of the maximum likelihood estimators for the random case is the same as 
for fixed factors. 

The sample covariance matrix under normality has the noncentral Wishart 
distribution [Anderson (1946a)] depending on WV, A®A’, and N — 1. Ander- 
son and Rubin (1956) proposed maximizing this likelihood function. How- 
ever, one of the equations is difficult to solve. Again the estimators are 
asymptotically equivalent to the maximum likelihood estimators for the 
random-factor case. 


(2) 


14.5. FACTOR INTERPRETATION AND TRANSFORMATION 


14.5.1. Interpretation 


The identification restrictions of A‘’W~'A diagonal or the first m rows of A 
being J,, may be convenient for computing the maximum likelihood estima- 
tors, but the components of the factor score vector may not have any intrinsic 
meaning. We saw in Section 14.2 that 0 coefficients may give meaning to a 
factor by the fact that this factor does not affect certain tests. Similarly, large 
factor Joadings may help in interpreting a factor. The coefficient of verbal 
ability, for example, should be large on tests that look like they are verbal. 

In psychology each variable or factor usually has a natural positive direc- 
tion: more answers right on a test and more of the ability represented by the 
factor. It is usually expected tl.at more ability leads to higher performance; 
that is, the factor loading should be positive if it is not 0. Therefore, roughly 


588 FACTOR ANALYSIS 


Ag 


A A 
(Al, Aza) 


A A 
«(A21, Age) 
A A 


A A 
*(Ag1, Aga) 


A A 
- (A351, A5Q) 


Figure 14.1, Rows of A. 


speaking, for the sake of interpretation, one may look for factor loadings that 
are either 0 or positive and large. 


14,5,2. Transformations 


The maximum likelihood estimators on the basis of some arbitrary identifica- 
tion conditions including ®=J are A and W. We consider transformations 


(1) Mt=AP,  &* =P \(po')'=(P'P)"'. 


If the factors are to be orthogonal, then ®* = J and P is orthogonal. If the 
factors are permitted to be oblique, P can be an arbitrary nonsingular matrix 
and @* an arbitrary positive definite matrix. 

The rows of A can be plotted in an m-dimensional space. Figure 14.1 is a 
plot of the rows of a 5 x 2 matrix A. The coordinates refer to factors and the 
points refer to tests. If @* is required to be J,,, we are seeking a rotation of 
coordinate axes in this space. In the example that is graphed, a rotation of 
45° would put all of the points into the positive quadrant, that is, Af, = 0. One 
of the new coordinates would be large for each of the first three points and 
small for the other two points, and the other coordinate would be small for 
the first three and large for the last two. The first factor is representative of 
what is common to the first three tests, aid the second factor of what is 
common to the last two tests. 

If mz > 2, a general rotation can be approximated manually by a sequence 
of two-d.inensional rotations. 


145 FACTOR INTERPRETATION: AND TRANSFORMATION 589 


If ®* is not required to be J,,, the transformation P is simply nonsingu- 


lar. If the normalization of the jth column of A is A,,,= 1, then 


(2) L= Any y= x Ac) k Pry 
=1 


each column of P satisfies such a constraint. If the normalization is d,, =], 
then 


vet 


(3) 1=45- ¥ (pry, 


where ( p/*) = P~!, 

Of the various computational procedures that are based on optimizing an 
objective function, we describe the varimax method proposed by Kaiser 
(1958) to be carried out on pairs of factors. Horst (1965), Chapter 18, 
extended the method to be done on all factors simultaneously. A modified 
criterion is 


2 5\2 
mp ERM Ae (Sia 
EE fay te) - BP Say | 
j=liel J=l]rsl 


which is proportional to the sum of the column variances of the squares of 
the transformed factor loadings. The orthogonal matrix P is selected so as to 
maximize (4). The procedure tends to maximize the scatter of at? within 
columns. Since a = 0, there is a tendency to obtain some large !oadings and 
some near 0, Kaiser’s original criterion was (4) with nie replaced hy 
MP JER AN. 

Lawley and Maxwell (1971) describe other criteria. One of them is a 
measure of similarity to a predetermined p xX m matrix of 1’s and 0’s, 


14.5.3. Orthogonal versus Oblique Factors 


In the case of orthogonal factors the components are uncorrelated in the 
population or in the sample according to whether the factors are considered 
random or fixed. The idea of uncorrelated factor scores has appeal. Some 
psychologists claim that the orthogonality of the factor scores is essential if 
one is to consider the factor scores more basic than the test scores. Consider- 
able debate has gone on among psychologists concerning this point. On the 
otner side, Thurstone (1947), page vii, says “it seems just as unnecessaTy to 
require that mental traits shall be uncorrelated in the general population as 
to require that height and weight be uncorrelated in the general population.” 

As we have seen, given a pair of matrices A, ®, equivalent pairs are given 
by AP, P-'®P'~' for nonsingular P’s. The pair may be selected (i.e.. the P 


$90 FACTOR ANALYSIS 


given A.@®) as the one with the most meaningful interpretation in terms of 
the subject matter of the tests. The idea of simple structure is that with G 
fuctor loudings in certain patterns the component factor scores can be given 
meaning regardless of the moment matrix. Permitting ® to be an arbitrary 
positive definite matrix allows more O's in A. 

Another consideration in selecting transformations or identification condi- 
tions is autonomy, or permanence, or invariance with regard to certain 
changes. For example, what happens if a selection of the constituents of a 
population is made? In case of intelligence tests, suppose a selection is made, 
such as college admittees out of high school seniors, that can be assumed to 
involve the primary abilities. One can envisage that the relation between 
unobserved factor scores f and observed test scores x is unaffected by the 
selection, that is. that the matrix of factor loadings A is unchanged. The 
variance of the errors (and specific factors), the diagonal elements of W, may 
also be considered as unchanged by the selection because the errors are 
uncorrelated with the factors (primary abilities). 

Suppose there is a tue model, A,®, W, and the investigator applies 
identification conditions that permit hiin to discover it. Next, suppose there is 
a selection that results in a new population of factor scores so that their 
covariance matrix is ®*, When the investigator analyzes the new observed 
covariance matrix W+ A®*A’, will he find A again? If part of the identifi- 
cation conditions are that the factor moment matrix is J, then he wiil obtain 
a different factor loading matrix. On the otter hand, if the identification 
conditions are entirely on the factor loadings (specified 0’s and 1’s), the factor 
loading matrix from the analysis is the same as before. 

The same consideration is relevant in comparing two populations. It may 
be reasonable to consider that W, = W,, A,;=A,, but ®, + ®,, To test the 
hypothesis that ®,=®,, one wants to use identification conditions that 
agree with A, =A, (rather than A,=A,C). The condition should be on 
the factor loadings. 

What happens if more tests are added (or deleted)? In addition to 
observing X=Af+ma+U. suppose one observes X* = A* f+ p* + U*, 
where U* is uncorrelated with U. Since the common factors f are un- 
changed, @® is unchanged. However, the (arbitrary) condition that A’W"'A 
is diagonal is changed; use of this type of condition would lead to a rotation 
of (A' A*’), 


14.6. ESTIMATION FOR IDENTIFICATION BY SPECIFIED ZEROS 


We now consider estimation of A, W, and ® when ® is unrestricted and A 
is identified by specified 0’s and 1’s. We assume that each column of A has at 


14.7 ESTIMATION OF FACTOR SCORES 591 


least m + 1 0’s in specified positions and that the submatrix consisting of the 
rows of A containing the 0’s specified for a given column is of rank m — 1. 
(See Section 14.2.2.) We further assume that each column of A has 1 in a 
specified position or, alternatively, that the diagonal element of ® corre- 
sponding to that column is 1. Then the model is identified. 

The likelihood function is given by (1) of Section 14.3. The derivatives of 
the likelihood function set equal to 0 are 


(1) diag &-'{C-—(W+A@A’')]>~! =diago, 
(2) A'S [C- (V+ A@BA')]E-7A=0 

for positions in ® that are not specified, and 

(3) SLC - (P+ APA’) EIA =0 
for positions in A not specified, where 

(4) L=W+AA. 


These equations cannot be simplified as in Section 14.3.1 because (3) holds 
only for unspecified positions in A, and hence one cannot multiply by & on 
the left. [See Howe (1955), Anderson and Rubin (1956), and Lawley (1958).] 

These equations are not useful for computation, The likelihood function, 
however, can be maximized numevically. 

AS noted before, a change in units of measurement, X* = DX, results in a 
corresponding change in the parameters A and W if identification is by 0 in 
specified positions of A and normalization is by ¢,=1, j= 1,...,m. It is 
readily verified that the derivative equations (1), (2), (3), and (4) are changed 
in a corresponding manner. 

Anderson and Amemiya (1988a) have derived the asymptotic distribution 
of the estimators under general conditions. Normality of the observations is 
not required. See also Anderson and Amemiya (1988b). 


14.7, ESTIMATION OF FACTOR SCORES 


It is frequently of interest to estimate the factor scores of the individuals in 
the group being studied. In the model with nonstochastic factors the factor 
scores are incidental pararieters that characterize the individuals. As we 
have seen (Section 14.4), the maximum likelihood estimators of the parame- 
ters (W,A,p, fi,--., fy) do not exist. We shall therefore study the estima- 
tion of the factor scores on the basis that the structural parameters (W, A, pr) 
are known. 


592 FACTOR ANALYSIS 


When f, is considered as an incidental parameter, x, — y is an observya- 
tion from a distribution with mean Af, and covariance matrix W. The 
weighted least squares estimator of f, is 


(1) fa= (ATW OA) AW. Bp) 
=TA'W h(x, -p), 


where [=A‘W™'A (not necessarily diagonal). This estimator is unbiased 
and its covariance matrix is 


(2) 8(F.-f.)( fafa) = (AWA) aE! 


by the usual generalized least squares theory [Bartlett (1937b), (1938)]. It is 
the minimum variance unbiased linear estimator of f,. If x, is normal, the 
estimator is also maximum likelihood. 

When f, is considered random [Thomson (1951)], we suppose X, and f, 
have a joint normal distribution with mean vector (y',0')' and covariance 
matrix 


0 afghe(Mate 


Then the regression of f on X (Section 2.5) is 

(4) &(fiX)=PA'( W+A@A')'(x- p) 
=0(b+ OD) OAH '(x—p). 

The estimator or predictor of f, is 

(5) ft =0(0+ OPH) SAW (x, - p). 

If ®=J, the predictor is 

(6) f2=(1+T) AW (x, =p). 


When I is also diagonal, the jth element of (6) is y,/(1+ y,) times the jth 
element of (1). In the conditional distribution of x, given f, (for ® =J) 


(7) &(ftlf)=(U+T) TS. 
(8) é(ftlf,)=(2+T)UTU+r)", 


PROBLEMS 593 
(9) 6[(F A.) (FF fa) Ve] = CFP th UE TY 

(10) é(f-f.) (FF -f) etry. 

This last matrix, describing the mean squared error, is smaller than (2) 


describing the unbiased estimator. The estimator (5) or (6) is a Bayes 
estimator and is appropriate when f, is treated as random. 


rFROBLEMS 


14.1. (Sec, 14,2) Identification by U's, Let 


where C is nonsingular, Show that 


0 AX) 


AC=|,. 
G Ay 


implies 


if and only if A” is of rank m— I. 
14.2. (Sec. 14.3) For p=5, m= 1. and A =A. prove | 67] =I). ,(A7/u,). 
14.3. (Sec. 14.3) The EM algorithm. 


(a) If f and U are normal and f and X are observed, show that the likelihood 
function based on (x,, f),...,€¢4,. fv) is 


~ 1 fi-A Gaae= Osa.) 
Il : aoe ae 
ami | (20)? TTP. 4, = : 


exp[- if." 'f, | 


l 
(Qr)"1@i 


594 FACTOR ANALYSIS 


(b) Show thal when the factor scores are included as data the sufficient set of 
Statistics is ¥, f,C.,=C, 


N 


Cy= a = (x. -E)(f, =f)’; 


=) 


igure é ae 
Cy = WN » (fa-F (fa =f) 3 


(ce) Show that the conditional expectations of the covariances in (b) given 
X=(x),...,4y), A, ®, and W are 


Cn fe P(C. 1X, A,®,W) = Cas 

C= &(C,1X,A,P,V)=C,,C8 + AA’) AS, 

Ch= E(CAX,A,®,V)= BA (CY TADA’) 'C,.CU FAA’) A® 
+@-@A'(V + APA’) AS, 


(d) Show that the maximum likelihood estimators of A and W given ® =/ are 


A 
Ww =c*,- C2. C8 CH. 


CHAPTER 15 


Patterns of Dependence; 
Graphical Models 


15.1. INTRODUCTION 


An emphasis in multivariate statistical analysis is that several measurements 
on a number in individuals or objects may be correlated, and the methods 
developed in this book take account of that dependence. The amount of 
association between two variables may be measured by the (Pearson) correla- 
tion of them (a symmetric measure); the association between one variable 
and a set may be quantified by a multiple correlation; and the dependence 
between one Set and another set may be studied by criteria of independence 
such as studied in Chapter 9 or by canonical correlations. Similar measures 
can be applied in conditional distributions. Another kind of dependence 
(asymmetrical) is characterized by regression coefficients and related mea- 
sures. In this chapter we study models which involve several kinds of 
dependence or more intricate patterns of dependence. 

A graphical model in statistics is a visual diagram in which observable 
variables are identified with points (vertices or nodes) connected by edges and 
an associated family of probability distributions satisfying some mdepen- 
dences specified by the visual pattern. Edges may be undirected (drawn as 
line segments) or directed (drawn as arrows). Undirected edges have to do 
with symmetrical dependence and independence, while directed edges may 
reflect a possible direction of action or sequence in time. These indepen- 
dences may come from a priori knowledge of the subject matter or may 
derive from these or other data. Advantages of the graphical display include 


An Introduction to Multivariate Statistical Analysis, Third Edition. By T. W. Anderson 
ISBN 0-471-36091-0 Copyright © 2003 John Wiley & Sons, Inc, 


595 


596 PATTERNS OF DEPENDENCE; GRAPHICAL MODELS 


ease of comprehension, particularly of complicated patterns, ease of elicita- 
tion of expert opinion, and ease of comparing probabilities. 

Use of such diagrams goes back at least to tlle work of the geneticist 
Sewall Wright (1921),(1934), who used the term “path analysis.” An elabo- 
rate algebra has been developed for graphical models. Specification of 
independences reduces the number of parameters to be determined. Some of 
these independences are known as Markov properties, In a time series analysis 
of a Markov process (or order 1), for example, the future of the process is 
considered independent of the past when the present is given; in such a 
model the correlation between a variable in the past and a variable in the 
future is determined by the correlation between the present variable and the 
variable of the immediate future. This idea is expanded in several ways. 

The family of probability distributions associated with a given diagram 
depends on the properties of the distribution that are represented by the 
graph. These properties for diagrams consisting of undirected edges (known 
as undirected graphs) will be described in Section 15.2; the properties for 
diagrams consisting entirely of directed edges (known as directed graphs) in 
Section 15.3; and properties of diagrams with both types of edges in Section 
15.4. The methods of statistical inference will be given in Section 15.5. 

In this chapter we assume that the variables have a joint nonsingular 
normal distribution; hence, the characterization of a model is in terms of the 
covariance matrix and its inverse, and functions of them. This assumption 
implies that the variables are quantitative and have a positive density. The 
mathematics of graphical models may apply to discrete variables (contingency 
tables) and to nonnormal quantitative variables, but we shall not develop the 
theory necessary to include them. 

There is a considerable social science literature that has followed Wright’s 
original work. For recent reviews of this writing see, for example, Pear! 
(2000) and McDonald (2002), 


15.2. UNDIRECTED GRAPHS 


A graph is a set of vertices and edges, G = (V, E). Each vertex is identified 
with a random vector. In this chapter the random variables have a joint 
normal distribution. Each undirected edge is a line connecting two vertices. It 
is designated by its two end points; (u,v) is the same as (v,u) in an 
undirected graph (but not in directed graphs). 

Two vertices connected by an edge are called adjacent; if not connected by 
an edge, they are called nonadjacent. In Figure 15.1(a) all vertices are 


15.2. UNDIRECTED GRAPHS 597 


ob b b b 
ae e a - . a i. a ran 
c c c Cc 
(a) (b) (c) (d) 
Figure 15.1 


nonadjacent; in (b) @ and b aie adjacent; in (c) the pair a and b and the pair 
a and c are adjacent; in (d) every pair of vertices are adjacent. 

The family of (normal) distributions associated with G is defined by a set 
of requirements on conditional distributions, known as Markou properties, 
Since the distributions considered here are normal, the conditions have to do 
with the covariance matrix & and its inverse A = &~', which is known as the 
concentration matrix. However, many of the lemmas and theorems hold for 
nonnormal distributions. We shall consider three definitions of Markov and 
then show that they are equivalent. 


Definition 15.2.1, The probability distribunon on a graph is pairwise 
Markov with respect to G if for every pair of vertices (u,v) that are not adjacent 
X,, and X,, are independent conditional on all the other variables in the graph. 


In symbols 
(1) X, UX IXy wy 


where l means independence and /’\(u, v) indicates the set V with « and uv 
deleted. The definition of pairwise Markov is that p,,,y\,,,.,,= 0 for all pairs 
for which (u, v) ¢ E. We may also write wu tt ul) V\ (u,v), 

Let & and A="! be partitioned as 


>» > A A 
(2) z=|5" al ame er 
BA BB BA BB 
where A and B are disjoint sets of vertices, The conditional distribution of 
X, given Xz is 
(3) N(X40%5aXp+ 2447 Lap hana): 
The condition i] covariance matrix is 


(4) Lap = 24a —-Xan2ap Xe, = re 


598 PATTERNS OF DEPENDENCE; GRAPHICAL MODELS 


If A =(1.2)and B =(@,..., p), the covariance of X, and X, given X;,...,X, 
iS O23 » iN Ly g=(o;,,3 ..p) This is 0 if and only if A, = 0; that is, E45 


is diagonal if and only if A.,, is diagonal. 


Theorem 15.2.1. If @ distribution on a graph is pairwise Mar':ov, d,,= 0 for 
G,jyeV. 


Definition 15.2.2. The boundary of @ set A, termed bd(A), consists of 
those vertices not in A that are adjacent to A. The closure of A, termed cl(A), is 
AU bd(A). 


Definition 15.2.3. A distribution on a graph is locally Markov if for every 
vertex v the variable X , is independent of the variables not in cl(B) conditional on 
the boundary of v: in notation, 


(5) A jak Xv vylX naw) 


Theorem 15.2.2. The conditional independences 


(6) XUYIZ, XUZIY 
hold if and only if 
(7) X 1(¥,Z). 


Proof. The relations (6) imply that the density of X, Y, and Z can be 
written as 


(8) f(x, y,z) =f(alz) g(ylz) A(z) 
= k(xly)/(zly)m(y). 


Since g( ylz)A(z) =nCy, z) =l(zly)m(y), (8) implies f(x|z) = k(xly), which in 
turn implies f(xlz) = k(xly) = p(x), Hence 


(9) f(x, ys 2) =p(x)n(y,z), 


which is the density generating (7). Conversely, (9) can be written as either 
form in (8), implying (7). a 


Corollary 15.2.1. The relations 
(10) XYIZ,W, XUZIY,W 
hold if and only if 
(11) X (Y,Z)|W. 


15.2. UNDIRECTED GRAPHS 599 


The relations in Theorem 15.2.2 and Corollary 15.2.1 are sometimes called 
the block independence theorem. They are based on positive densities, that is, 
nonsingular normal distributions, 


Theorem 15.2.3. A locally Markov distribution on a graph is pairwise 
Markov. 


Proof. Suppose the graph is locally Markov (Definition 15.2.3). Let u and 
v be nonadjacent vertices. Because v is not adjacent to uw, it is not in bd(z); 
hence, 
(12) x, a Xv atuy(X vaca): 
The relation (12) can be written 
(13) Xx, WAX, Xy\puco.bacuyt|bd(u). 
Then Corollary 15.2.1 (X=X,, Y=X,, 2=Zy\ecuyop W = Xpauy) implies 


(14) XU XIX yuo a 


Theorem 15.2.4. A pairwise Markov distribution on a graph is locally 
Markov. 


Proof. Let V\cl(u) = uv, U ++ Uu,, Then 
(15) ulu|bd(u)Un,U+Uy, ull oibd(u) Uv, Uv,UUy, 
which by Corollary 15.2.1 implies 


(16) ull v, Ju|bd(u) Uv, U Uy,. 


Further, (16) and 


(17) u ll v,|bd{u) Uv, Vu, Uy, Ue UD, 
imply 
(18) ull v, Uv, Uv,|bd(u) Un, Us Uy,. 


This procedure leads to 
(19) ull v,U+ Uu,|bd(x). a 


A third notion of Markov, namely, global, requires some definitions. 


600 PATTERNS OF DEPENDENCE; GRAPHICAL MODELS 


Definition 15.2.4. A path from B to C is @ sequence vy, U,,U2,+..,U, Of 
adjacent vertices with uv, = B and uv, © C. 


Definition 15.2.5. A set § separates sets B and C if S, B, and C are 
disjoint and every path from B to C intersects S. 


Thus S separates B and C if for every sequence of vertices up, U,,..., U, 
with vp €B and v, €C at least one of v,,...,u,_, is a vertex in §. Here B 


and/or C are nonempty, but § can be empty. 


Definition 15.2.6. A distribution on a graph is globally Markov if for every 
triplet of disjoint sets §, B, and C such that S separates B and C the vector 
variables X, and X¢- are independent conditional on Xz. 


In the example of Figure 15.1(c), a separates 6 and c. If p,,.., =0, that is, 
Phe — Pha Pac = 0, the distribution is globally Markov. Note that a set of 
vertices is identified with a vector of variables. 

The global Markov property puts restrictions on the possibie (normal) 
distributions, and that implies fewer parameters about which to make infer- 
ences. , 

Suppose V =A UBUS, where A, B, and S are disjoint. Partition % and 
A = 7!) the concentration matrix, as 


-1 
Aaa A 4p Aus LAA Las Las 

(29) A=|Ap, App Ags|=|25, Lap Yas 
As, Asa Ass | 12s4 2sp Xs 


The conditional distribution of (X/,, X,)’ given ¥, is normal with covariance 
matrix 


x >» >» 
(Ramen |S S| [SM lasts, Sal 
= Aga A 4p > 
Apa App 


Theorem 15.2.5. If § separates A and B in a graph with a globally Markov 
distribution, A ,, = 9. 


Proof. Because § separates A and B, every element u of A and every 
element v of B are nonadjacent, for otherwise the path (u, v) would connect 


15,2 UNDIRECTED GRAPHS 601 


A and 8 without intersecting S$, The globally Markov property is that X, 
and X, are uncorrelated in the conditional distribution, implying that 2, , »,, 
is block diagonal and hence that A,, = 0. a 


Theorem 15.2.6. A distribution on a globally Markov graph is pairwise 
Markov, 


Proof. Let the set B be i, the set C be j not adjacent to /, and the set A 
the rest of the variables. Any path from B to C must include elements of 4, 
Hence i is independent of j in the distribution conditioned on the other 
variables. | 


Theorem 15.2.7. A globally Markov family of distributions on a graph is 
locally Markov. 


Proof. The boundary of a set B separates B and V\cl(B). z= 


Theorem 15.2.8, A pairwise Markov family of distributions on a graph is 
globally Markov. 


Proof. Let A, B, and § be disjoint sets in a pairwise Markov grapli such 
that S separates A and B. Let #(S) and #(V) denote the numbers of 
vertices in § and V, respectively. If #(V) = #(S) + 2, that is, V=AU BUS. 
then there must be One vertex in each of A and B, and the pairwise Markov 
property is exactly the globally Markov property. The rest of the proof is a 
backward induction on #(S). Suppose #(V) — #(S) > 2 and V=AUBUS. 
Then either A or B or both have more than one vertex. Suppose A has more 
than one vertex, and let u¢A. Then §Uu separates A\u and B, and 
SUA separates u and B. By the induction hypothesis 


(22) X avy Xgl (Xs, Xu), X, Xgl (Xs. Kanu): 
By Corollary 15.2.1 
(23) X, i XglXz. 


Now suppose AUBUS CV, Let uw EV\(AUBUS), Then S Ute separates 
A and B. By the induction hypothesis 


(24) X, I Xgl (Xs, X,)- 


Also, either AUS separates u and B or BUS separates A and un. 
(Otherwise there would be a path from B to u and from u to A that would 


602 PATTERNS OF DEPENDENCE; GRAPHICAL MODELS 


not intersect $.) 1f AUS sepacutes vo and 8B, 

(25) X, I Xgl( Xs, X,). 
Then Corollary 15.2.1] applied to (19) and (20) implies 
(26) (X,.X,) iL XglX5, 


from which we derive X, lb X,|X,. = 


Theorems 15.2.3, 15.2.5, and 15.2.6 show that the three Markov properties 
are equivalent: any one implies the other two. The proofs here hold fairly 
generally, but in this chapter a nonsingular mult' variate normal distribution is 
assumed: thus all densities are positive. 


Definition 15.2.7. A graph G =(V, E) is complete if and only if every two 
vertices in V are adjacent. 


The definition implies that the graph specifies no restriction on the 
covariance matrix of the multivariate normal distribution. 

A subset 4 CV induces a subgraph G, = (A, E,,), where the edge set E, 
includes all edges (u,v) of G with (u,v) EE, where wEA and ved. 
A subset of a graph is complete if and only if every two vertices in A are 
adjacent in E,. 


Definition 15.2.8. A clique is a maximal complete set of vertices. 


“Maximal” means that if another vertex from V is added to the set, the set 
will no longer be complete. A clique can be constructed by starting with one 
vertex. Say v,. If it is not adjacent to any other vertex, v, alone constitutes a 
clique. lf u, is udjacent to v, [(v,,v,) EE], continue constructing a clique 
with v, and v. 1 it until a maximal complete subset is obtained. Thus every 
vertex is a member of at least one clique, and every edge is included in at 
least one clique. 


Lemma 15.2.1. /f the distribution of X,. is Markov, it is determined by the 
set of marginal distributions of all cliques. 


In Figure 15.1(t) each of a,b,c is a clique; in (b) cach of (a,b) and ¢ isa 
clique; in (c) cach of (a,b) and (a,c) is a clique; in (d) (a, b,c) is a clique. 


Definition 15.2.9. The density f(X,,) factorizes with respect to G if there 
are nonnegative functions g.-(X,-) depending on the complete subgraphs such that 


(27) AXv)y= TT acl Xe). 


C complete 


15,2. UNDIRECTED GRAPHS 603 


Since it suffices to consider only cliques, an alternative factorization is 


(28) ae IT] gcx{Xc+). 


* cliques 


These functions g-(X-) and g¢«(X¢s) are not necessarily densities or 
conditional densities, The problems of statistical inference may be reduced to 
the problems of the complete subgraphs or cliques. 


Definition 15.2.10. A decomposition of a graph is formed by three disjoint 
sets A,B, S ifV=AUBUS, S separates A and B, and § is complete. 


In this definition one or more of the sets A, 8B, and § may be empvy, If 
both A and B are nonempty, the decomposition is termed proper. 


Definition 15.2.11. A graph is decomposable if it is complete or if there is 
a proper decomposition (A, B,S) into decomposable subgraphs G5 and 
Gaus: 


Theorem 15.2.9. Suppose A, B, S decomposes G = (V, E), Then the density 


of X) factorizes with respect to G if and only if its marginal densities f 4, 5(x40 5) 
and fyus(x_u 5) factorize and the densities satisfy 


(29) flay) = favalau seas Sous) 


Proof. Suppose that f,,(x,,) factorizes as 
(30) fuley) = [1 8-(x.)- 


Because A, B, S decomposes G, every clique is either a subset of A US ora 
subset of BUS. Let . denote the cliques that are subsets of AUS, and @ 
those that are subsets of B. Then f,.(x,) = A(x, 4 5)k(xgyU 5), where 


(31) A(x,us)= II gc(xe), 
Cee 

(32) K(Xpus) = I] &c(Xc). 
CEB 


Integration of (30) with respect to x, for C= B\ & gives 


(33) faus(®aus) =A(tyus)k(45), 


604 PATTERNS OF DEPENDENCE; GRAPHICAL MODELS 


difficulty grade recommendation 
t 3 4 
ss. _——_—> @ ———_ —_—> 6 


1Q SAT 
Figure 15.2 
where 
(34) K(xs)=fk(xgus) vp. 


In turn fays(*%4us) and fpys(xgus) can be factorized, leading to (28). 


15.3. DIRECTED GRAPHS 


We now include relations with a direction; the measurement represented by 
one vertex u may precede the measurement represented by another vertex v. 
In the graph this directed edge is displayed as an arrow pointing from u to uv; 
in notation it appears as (u,v), which is now distinguished from (v,u). The 
precedence may indicate the times of measurement, for example, the precipi- 
tation on two successive days, or may indicate possible causation. 

The difficulty of an examination x, may affect the grade of a student x,; 
the grade is also affected by his/her IQ x,. In turn the grade of the student 
influences the quality of a letter of recommendation x,; the IQ is a factor in 
performance on the SAT, x,. See Figure 15.2. (We shall draw figures so that 
the action proceeds from left to right.) 

A graph composed entirely of directed edges is called a directed graph. A 
cycle, such as 1 > 2, 2 3, 3 + 1, ts hard to interpret and hence is usually 
ruled out, A directed graph without a cycle is an acyclic directed graph (ADG 
or DAG), also known as an acyclic digraph. All directed graphs in this 
chapter are acyclic, 

An acyclic directed graph may represent a recursive linear system. For 
example, Figure 15.2 could represent 


(1) Xx, = iy, 
(2) X= Uy, 


15.3. DIRECTED GRAPHS 605 


(3) X3 = B3,X\ + By Xr + uy, 
(4) X= By Xt uy, 
(5) Xs = Bs X71 + Us, 


where u,, U,,%4,U,,Us arc Mutually independent unobserved variables, 
Wold (1960) called such models ceisal chains, Note that the matrix of 

coefficients is lower triangular, In general X, may depend on X).,.... Xs 
The recursive linear system (1) to (5) generates the recursive factorization 


(6) Fasas (415 X21 X35 Xa Xs) 
=f iC01) far) fara (%5)15 82) fanaa (al 3) fouese(45)22). 
A directed graph induces a partial order. 


Definition 15.3.1. A partial ordering of an acyclic directed graph u <v 1s 
defined by the existence of a directed path 


(7) U=Uy) 70; 9 SU 


fit 


= U, 
The partial ordering satisfies the conditions (i) reflexive: v < v: (ii) transitive: 


u<v and usw imply uw <4»; and (iii) antisymmetric: u <u and » <u imply 
=. Further, wu <u and u # vu defines u <v. 


Definition 15.3.2. If uv, then u iva parent of v, termed u = pau), and 
visa child of u, termed v = ch(u). In symbols 


(8) pa(v) = {we V\ulw > ov}, 
(9) ch(u) ={weV\ulu>w}. 


In the graph displayed in Figure 15.2 we have (1,2) = pa(3), 3 = pa(4). 
2 = pa(5), 3 = ch(1, 2), 4 = ch(3), and 5 = ch(2). 


Definition 15.3.3. [fu <v, then visa descendant of u, 
(10) de(u) = {vlu <u}, 
and u is an ancestor of v, 
(11) an(v) = {ulu<v}. 


The set of nondescendants of u is Nd(1r) = V\de(z), and the set of strict 
nondescendants is nd(u) = Nd(u)\\u. Define An(A) = an(.4) U4. 


606 PATTERNS OF DEPENDENCE; GRAPHICAL MODELS 


pa(v) 
Ww 
e be de(v) 
ew 
Figure 15.3 
Note that 
(12) pa(v) Can(v) Cnd(v). 


In our study of undirected graphs we considered three Markov properties 
independently defined and then showed that a graph with one Markov 
property also has the other two. In the case of acyclic directed graphs we 
shall define three similar Markov properties, but the definitions are different 
because they take account of the direction of action or influence. 


Definition 15.3.4. A distribution on an acyclic directed graph G is pairwise 
Markov if for every v € V and w € nd(v)\ pa(v) 


(13) vt w[nd(v)\w. 


In comparison with Definition 15.2.1 for undirected graphs, note that 
attention is paid only to vertices in nd(v); since pa(u) is the effective 
boundary of v, the vertices w and v are nonadjacent. (See Figure 15.3.) Note 
also that the conditioning set inclues the parents of v, but not the children 
(which are descendants). 


Definition 15.3.5. A distribution on an acyclic directed graph is locally 
Markov if 


(14) v [nd(v) \ pa(v)] | pa(v). 


In the definition of locally Markov the conditioning is only on the parents 
of x. but in the definition of pairwise Markov the conditioning is on all of the 
other nondescendants. These features correspond to Definitions 15.2.1 and 
15.2.3 for undirected graphs. 

In Figure 15.2, we have 112.5, 35/2. 4.11,2,513, and 5 11,3, 4]2. 
In an undirected graph constructed by replacing arrows in Figure 15.2 by 


15,3. DIRECTED GRAPHS 607 


lines (directed edges by undirected edges), a locally Markov distribution on 
the graph would include the conditional independences 1 1 2|3, 1,243, 
1,3,415. In the interpretation of the arrow indicating time sequence X, 
relates to the future of (X,, X;); the future cannot be conditioned on. 


As another example, consider an autoregressive time series yo, ¥y,.--, Yr 
defined by 
(15) y=py.tu,, £=1,2,...,T, 
where wy,...,Up are independent N(0, o*) variables and y, has distrioution 


NI0, 77/(1 — p)*)]). In this case given y,, the future y,,,,..., yy is indepen- 
dent of the past yo,...,¥,-1- 


Theorem 15.3.1. A locally Markov distribution on an acyclic directed graph 
is pairwise Markov. 


Proof. The proof is the same as the proof of Theorem 15.2.3 for undi- 
rected graphs. a 


Theorem 15.3.2. A pairwise Markov distribution on an acyclic directed 
graph is locally Markov. 


Proof. The proof is the same as the proof of Theorem 15.2.4. a 


Another Markov property is based on numbering the vertices in an order 
reflecting the direction of the action or the partial ordering induced. 


Definition 15.3.6. An enumeration of the elements of V is called well- 
numbered ifi<j = uv, ¢u, or equivalently v, <u, => j <i. 


Theorem 15.3.3. A finite ordered set (V,<) admits at least one well- 
numbering. 


Definition 15.3.7. An element a* EV is maximal (or terminal) if a* <b 
=> a* =b. 


Lemma 15.3.1. A finite, partially ordered set (V, <) has at least one 
maximal element a*, 


Proof of Lemma, The proof is by induction with a* =a if #(V)=1. 
Assume the lemma holds for #(V) =n, and consider #(V) =n +1, Then 
V=aU(V\a) for any a EV. Since #(V\a@) =n, V\a has a maximal ele- 
ment, say @. Then either @<a and so a is maximal, or @¢a@ and so @ is 
maximal. | 


608 PATTERNS OF DEPENDENCE: GRAPHICAL MODELS 


Proof of Theorem 15.3.3, We shall construct a well-numberjng. Let v* be a 
maximal element; define v, =v*. In V\u, Jet u** be a maximal element; 


define u,_ saat At the jth stage let v*** be a maximal element in 
a ce Un-j+1; define u,_. =v***, j=3,...,2~-1. Then vu, = 
V\ Gon: ,U,.,). This eonieiniction satisfies Definition 15,3,6. | 


The well-numbering of V as u,,.,,u°" implies that in any directed path 
= yo) —» YP ye =, uv“) =v the indices satisfy ij <i,< +: si,. The 
well-numbering is not necessarily unique, Since V is finite, a maximal 
element can be found by comparing uv, and v, for at most n(m — 1)/2 pairs. 


Definition 15.3.8. Let {u,,...,u,} be a well-numbering of the acyclic di- 
rected graph’ G. A distribution on G iy well-numbered Markov with respect to 
this well-numbering if 


(16) u, IL (v,,.--,U,-1) \paCu,)| pa(u,), Pas. fe wcitls 


Apparently the definition depends an the choice of well-numbering, but 
this is not the case, by Theorem 15.3.4. 


Theorem 15.3.4. A distribution on an acyclic directed graph that is well- 
numbered Markov is locally Markov, 


Proof. (vy,...,U,.,) € nd(u,)\ patu,), | 


The definition of the global Markov property depends on relating the 
directed graph to a corresponding undirected graph. 


Definition 15.3.9. The moral graph G™ of an acyclic directed graph G = 
(V, E) is the undirected graph constructed by adding (undirected) edges hetween 
parents of each vertex v © V and replacing every directed edge by an undirected 
edge. 


In the iargon of graph theory, the parents of a vertex are “married.” 


Definition 15.3.10. A distribution on an acyclic directed graph is globally 
Markov if A it B|S for every A, B, and § such that S separates A and B in 


[Gancaueus!”: 


Theorem 15.3.5. A distribution on an acyclic directed graph thet is globally 
Markov is locally Markov. 


15.3 DIRECTED GRAPHS 609 


Proof. For any v€V let pa(v) =S in the definition of globally Markov. 
Let v=A and nd(v)\ pa(v) = B. A vertex w < nd(v)\ pa(v) is a vertex in 
An(AUBUS). Let m=Ww=Up,U,,...,U, =U be a path from w to ¢ in 
[Gay avausyl” =[Gyauyl”. If (u,-,,u,) corresponds to a directed edge 
(v1 > ¥,) in [Gya}”, then v,_, € pa(v)=S and pau) separates 
nd(v)\ pa(v) and v. [The directed edge (v,_, —u,) implies v,_, € de(v).] 

a 


Theorem 15.3.6. A distribution on an acyclic directed graph that is locally 
Markov is globally Markov. 


The proof is very lengthy and is omitted. 


Recursive Factorization 

The recursive aspect of the acyclic directed graph permits a systematic 
factorization of the density. Use the construction of Theorem 15.3.4, Let 
n=|V\; then u, is a maximal element of V. Then 


(17) Xv \aqu,) L X,,| pau, )- 


Thus (in the normal case) 


(18) &X, |pau,) =O, + BX pace.) 

(19) O(X, — EX, WX, — &X, ) =%,. 

At the jth step let u,_,,, be a maximal element of V\(u,,...,U,~-,42 ). Then 
(20) XV [geceeePnapar Cl Yeyey 1 Xy,.,,, PAC Una) 

Thus 

(21) EX X,, _j+)! PAC Y, aya) = Oy pay + By ay ¥patn-,.))* 


(22) EC Xe a, ~ EX, Xo yas a EX) = Ynys oe a I.saTk. 


The vector X, |, is independent of pa(v,_,,,). The relations (18) to (22) 
can be written as generating equations. Let 


(23) X,=a,+ 6), 
(24) X,=Q,+B,x,+€), 
(25) Fics SO ap Bg ee eee Xk aak) epg 


(26) x, =O, +B,(x1,..-.%, n= 1) a oe 


610 PATTERNS OF DEPENDENCE; GRAPHICAL MODELS 


where €,,...,€, are independent random vectors with ¢'e,e,= 2. In matrix 
form (23) to (26) are 


(27) Bxr=ate, 

where 
a, I 0 0 0 e 
a, -B, I 0 0 e5 

(28) Qc a, : B= -B, - B,, I 0 , €= £3 : 
a, —B,, —B,» -B,; oe 7 €, 


and B,=0 if i<j—k,. Because the determinant of B is 1, (27) can be 
solved for | 


(29) x=T"'a+TDo'e, 


The matrix [~' is also lower triangular. 


15.4. CHAIN GRAPHS 


A chain graph includes both directed and undirected edges; however, only 
certain patterns of vertices and edges are permitted. Suppose the set of 
vertices V of the graph G=(V,£) can be partitioned into subsets V = 
V1)U---UV(T) so that within a subset the vertices are joined by undi- 
rected edges and directed edges join vertices in different subsets. Let 7(G) 
be the set of vertices 1,...,7 and let &(G) be the (directed) edge set such 
that t—> o if and only if there is at least one element u € V(r) and at least 
one element v€V(a) such that u—v is in E, the edge set of G. Thea 
2(G)=[7(G), &(G)] is an acyclic directed graph; we can define pa (7), 
etc., for ‘2(G). 

Let X,=(X,|u e@V(7)}. Within a set the vertices form an undirected 
graph relative to the probability distribution conditional on the past (that is, 
earlier sets). See Figure 15.4 [Lauritzen (1996)] and Figure 15.5. 

We now define the Markov properties as specified by Lauritzen and 
Wermuth (1989) and Frydenberg (1990): 


(C1) The distribution of X,, T= 1,...,7, is locally Markov with respect to 
the acyclic directed graph <7(G); that is, 


(1) X, UX MXpa ery gEnd.,(7)\pag(7). 


15.4. CHAIN GRAPHS 611 


V(2) 
Figure 15.4. A chain graph. 


vil) << _ —FV(3) 


2 


¥(2) 
Figure 15.5. The corresponding induced acyclic direcled graph on V = V(1) U V(2) U VG}, 


(C2) For each 7 the conditional distribution of X, given X,, ,(7) is globally 
Markov with respect to the undirected graph on V(r). 
(C3) 


(2) X, W X,|Xyg(U), 4 EU S(T), ve pag(T)\pag(U). 


Here bd, (U) = pag(U) Unb, (UV). A distribution on the chain graph G that 
satisfies (C1), (C2), (C3) is LWF block recursive Markov. 

In Figure 15.6 pag(r) ={7— 1,7- 2} and nd g(r) \ pag(r) = {7 - 3, 7- 
4,...,l}. The set U={u,w} is a set in V(r), and pa,(U) is the set in 
V(r—1) UV(r— 2) that includes pag(u) for u € U; that is, pag(U) = {x, y}. 


V(t-1) V(t) 


Figure 15.6. A chain graph. 


612 PATTERNS OF DEPENDENCE; GRAPHICAL MODELS 


v1) V(2) 
Figure 15.7. A chain graph, 


Andersson, Madigan, and Perlman (2001) have proposed an alternative 
Markov property (AMP), replacing (C3) by 


(C3*) 
(3) X,LXIXao ¥EUGV(T), ve pag(t)\pag(U). 


In Figure 15.6, X, for a vertex v in V(r-- 2) UU VU(r— 1) is conditionally 
independent of X, [u@UCV(r)] when regressed on X,, (uy) = (X,.Xy). 
The difference between (C3) and (C3*) is that the conditioning in (C3) is on 
bd¢(U) = pag(U) Unb, (U), but the conditioning in (C3*) is on pag(U) 
only. See Figure 15.6. The conditioning in (C3*) is on variables in the past. 
Figure 15.7 [Andersson, Madigan, and Perlman (2001)] illustrates the differ- 
ence between the LWF and AMP Markov properties: 


(4) LWF: =X, 1 X,|X,,X,,  X,  X,|X,, X,, 
(5) AMP: = X,1-X,|X2, X,  X,|X,. 
Note that in (5) X, and X, are conditionally independent given X,; the 
conditional distribution of X, depends on pa(v,), but not X3. 
The AMP specification allows a block recursive equation formulation. 


In the example in Figure 15.7 the distribution of scalars X, and X, 
[v,, ¥, = V(1)] can be specified as 


(6) X) = 4; 


where (é€,,€,) has an arbitrary (normal) distribution. Since X, depends 


15.5 STATISTICAL INFERENCE 613 


directly on X, and X, depends directly on X,, we write 


(8) X34 = By, X, + €3, 
(9) X= Bay X_ + €y, 


where (é,,€,) has an arbitrary distribution independent of f¢,,¢€.), and 
hence independent of (X,, X,). 
In general the AMP madel can be expressed as (26) of Section 15.3. 


15.5. STATISTICAL INFERENCE 


15.5.1. Normal Distribution 


Let x;,...,;%, be N observations on X with distribution N(p, 2), Let x= 
NDN_.x, and S=(N—-1)°'D*_ (x, - xx, —¥Y =(N-1) TE8L x x, 
-~ Nxx']. The likelihood is 


(yay Sree Ge) 


= (2a) "?/?| S| -N /2 en ACN=D)rS2 7 4 N(R-pyd e- wL 


The above form shows that X and S§ are a pair of sufficient statistics for p 
and %, and they are independently distributed. The interest in this chapter is 
on the dependences, which depend only on the covariance matrix %. not p. 
For the rest of this chapter we shall suppress the mean. Accordingly, we 
suppose that the parent distribution is N(0, 2) and the sample is x,...., Be), 
and S=(1/n)L",_, x, x. Tre likelihood function can be written 


(2) (2a) P"/7| AlPn? e- siranAS 


= €xp —W(A)- 5 Lat ae 7 


i=l <j 


where A=(A,)=2°', T=(t,)=C9..%%, and W(A)= pn log(27) 
— dnlogl Al. 

The likelihood is in the exponential family with canonical parameter 
and statistic 7. The maximum likelihood estimator of 2 with no restriction is 
%=S=(1/nP. Since A= =~! is a 1-to-1 transformation of Z, the maxi- 
mum likelihood estimator of A of A= 371, 


614 PATTERNS OF DEP'=NDENCE; GRAPHICAL MODELS 


15.5.2. Covariance Selection Models 


In undirected graphs many of the models involve zero restrictions on ele- 
ments af A. Dempster (1972) studied such models and introduced the term 
covanance selection. When the (directed) graph satisfies the pairwise Markov 
candition, A= 0 for (i, /) €£. We assume here that the graph satisfies this 
Markov condition. Further we assume mn > p; then S is positive definite with 
probability 1. 

The likelihood function is 


P 
(3) (27) ee A mis exp 3 s Aj + = Aj ns; |, 


i=t (jee 


where A satisfies the condition A,,= 0, (i, j) €£. In this form the canonical 
parameters are A,,,.-.,Ap, and A,,, (,j)@E. The canonical variables are 
Syjss-+.5,, and s,, (i,j) © E; these form a sufficient set of statistics, To 
maximize the likelihood function we differentiate (3) with respect to 4,,, 
r=1,..,. p, and A,,, (i, /) © E, to abtain the equations (4) and (5). 


Theorem 15.5.1. The maximum likelihood estimator of X in the model (3) 
is given by 


(4) Gi, 5,5 i=jor (i,j) €E, 
(5) h, = 9, i xjand (i,j) €E, 
where A= 37!. 


This result follows from the general theory of exponential families. See 
Lauritzen (1996), Theorem 5.3 and Appendix D.1. 

Here we shall show that for a decomposable graph the equations (4) and 
(5S) have a unique positive definite solution by developing <n algorithm for its 
computation. We fallow Speed and Kiiveri (1986). 


Theorem 15.5.2. Let L and M be pp positive definite matrices. There 
exists a unique positive definite matrix K such that 


(6) Ky= ay? }=jor (i,j) ek, 
(7) kl = mii, i#jand (i,j) €E, 
where (Kk!) =K~! and (m?)= M7’. 


The proof of Theorem 15.5.2 depends on several lemmas. In the maximum 
likelihood estimation L = S$, M=T or any other diagonal matrix, and K= %. 


15.5 STATISTICAL INFERENCE 615 


To develop this subject we use the Kullback mformation. For a pair of 
multivariate normal distributions N(0, P) and N(0, R) define 


=o lop 210, P) 

(8) I{ P|R) = &p OB (a10, R) 
=. ~4[log]|PR7'| +tr(1— PR7')]. 

Lemma 15.5.1. Suppose P and R are positive definite. Then: 


(i) ((P|R)>0, P#R, and I(P|P)=0. 
(ii) If (P,} and {R,} are sequences of positive definite matrices such that 
I(P,|2%,,) > 0, then P,R7' I. 


Proof. (i) Let the roots of |P — sR| =O be s,< °° <s,. Then 
Pp 

(9) log|PR7'| +tr(I- PR~') = ¥ (logs, +1—s,) = 0, 
i=[ 


and (9) is 0 if and only if s, = --- = 1, 
(ii) Let the roots of | P, ~ sR, = = é be s(n) < + <s,(n). Then I(P,|R,) 
—0 implies [s,(n),.. .8,()1 > (1, ,1), which implies that P,R7' J. 
a 


Lemma 15.5.2. Let 


(10) P= 


Then 


(i) The matrix 


(11) Q-= Py, Py RT'R, 
RyRy Py Ry — Ry RG Ry + RyRy PyRY Ry 


satisfies Q,, = P,,, Q'2=R™, and Q” = R”, where 


= al = = 
(12) Q- = Pi +R"(R”) R22 R® Ee Py} — Rj} 0 jek 
R? R® 0 0 


(ii) J(P|R) = [(P|Q) + I(Q|R). 


616 PATTERNS OF DEPENDENCE; GRAPHICAL MODELS 


Proof. (i) Let 


0 o-[s 2 


R2) R” 


Then I=@Q~'@ can be solved for S=P;,;'+R'’(R”)'R7; Q=(Q7')7! 
follows from Theorem A.3.3. Then (ii) follows from 


AS -i 
I fal I} QR-'= 


and |PQ~'| -|@R~'| =|PR7'|. From (13) and (14) 


(14) PQ°'=PR ! + 


P,, Ry! 0 
RyRYPy 1)’ 


(15) trPQ~’+trQR-'=trPR™'+trJ,. a 


Lemma 15.5.2 provides the solution to the problem of finding a matrix @Q, 
given positive definite matrices P and R, such that 


(16) 4.) = Piy> Chd)-S [leas t}; 
(17) quar, (ij) €(,-.t}. 


We now develop an iterative method to find K to satisfy (6) and (7), thus 
proving Theorem 15.5.2. Suppose E=c, U ++: Uc,,, where ¢),...,¢,, are the 
m cliques of a decomposable graph G=(V,E). Let Kj>'=M™'. Define 
recursively K, = (k,,(m)) such that 


(18) k(n) =1,,, i, f © Cn mod m> 
(19) k(n) = k(n —1), Ey f ECu mod m: 


By Lemma 15.5.2, K,, is uniquely determined. (The algorithm cycles through 
the cliques.) By construction 


(20) 1(LIK,_;)=1(L1K,) + 1(Ky|Ky-1)- 


Summation of (20) from 1 to q gives 
q 
(21) I(L|Ky)=/(L1K,) + 2 7(K,/K,_,)- 
j=l 


Since I(L|K,| 20, 7, I(K,|K,_,) is bounded and IK |K;_;) —>Oasnoo. 
The set (K~"|J(L|K) <I(L|K,))} is strictly convex. 

Consider the vector sequence (K,..4,;2-«:>Kpmam) With index r (m=rm). 
It has a convergent subsequence {r(i)}; that is, (Kyri412--+> Kmriyam) CON- 
verges to (K*,...,K3), say. Since /(K,|K,_;) 0, K,K7!, 1. Then the 


15.5 STATISTICAL INFERENCE 617 


Cy es: weed nace 


(i,j) € CHU Cy 


Figure 15.8, Diagram of c, and c, fore, Uc, = £, 


matrix Kyry+j;Karysj-1 Of, f= 2,...,m. which implies KT = ++ = Ki = 
K, say. Note that {i, jli,j €£) satisfies i,jéc,, i=1,...,m. Hence K, 
satisfies (7), n=0,1,..., and K does too. Further, k, Cm) +t) satisfies 
(18) i,j Ec, and K does, too. Figure 15.8 diagrams the sets for ¢, = (i, j). 
i,j=l,...,t, and c,=G,)),i,jeuutl..wpu<t. 

The procedure allows for construction of a multivariate normal distribu- 
tion with arbitrary marginal distributions over the cliques c,,...,c,,, provided 
that the specified marginal distributions are consistent. 

Theorem 15.5.2 provides a proof of the existence and uniqueness of the 
maximum likelihood estimators. 

The equati.n (12) is an updating equation. When Q>'=K7! and R7' = 
K;7!,, the entries in K7!, not in c,, gog remain unchanged. 

Dempster (1972) also proposes some iterative methods for finding K 
satisfying k,,=/,, G,j)€D, and k'=m"!, Gi,j)€D. The entropy of 
n(x|0, P) is 


1)? 


(22) &, log n(xl0, P) = —4( plog27— p— logl Pl). 


En? 
tion of p,, to maximize the entropy of the fitted normal distribution satisfying 
the requirements also minimizes |R| [Demspter (1972)]. 


Note that |=] =1]72.,9;,|Rl, where R=(p,,). Given that ¢,,=5,,, the selec- 


618 PATTERNS OF DEPENDENCE; GRAPHICAL MODELS 


15.5.3. Decomposition of Covariance Selection Models 


An undirected graph is decomposable if the graph is formed by three disjoint 
sets 4, B,C, where V=AUBUC, A and B are nonempty, C separates A 
and B, and C is complete. Then if X, is globally Markov with respect to G, 
we have X, ll X,|X¢, 


(23) x'=A=] 0 Aza Asc}, 


> Dy >y 

24) Liaa)c= see adh ag LeC(Zca>Zca)> 
Le, Xpp y 

and 

(25) Yan — Lackec&ca = 9. 


The maximum likelihood estimator of £ can be constructed from the 
maximum likelihood estimators Of X44.c, Zas-c> Zpp-c> Beapy-c> and Xce- 
If there is no restriction on {, the maximum likelihood estimator of = is 


S cts Scls S 
(26) ox (AB)-C (AB)CPCC*C(AB) (AB)C ES 0 
Sccas) Scc 
where 
Saac Sasc 
(27) Stasyc= F s > Siasyc = (Suc. Sgc). 
BAC BB-C 


If the restriction (25) is imposed, the maximum likelihood estimator is (26) 
with S4.¢ replaced by 0 to obtain 


SAAC 0 S| Sac 

a t: Sect Seis5 
(28) ee | 0 hee co CC' “CA ca) Sic 
(Soa, Sca) Sec 


The matrix S:45)¢ has the Wishart distribution WX, 45).c.% — (p, + Pp) 
where p, and py are the number of components of X, and X,, respectively 


15.5 STATISTICAL INFERENCE 619 


(Section 8.3). The matrix Bi 4p).c = ScasycSce conditional on (X¢,,..., Xen) 
=X has a normal distribution, the covariance of which is given by 


Bac Bac Zu, 0 
29 & vec vec S-c=S-L@ wee lle 
a) el i a See 0 Lap 


and Scc has the Wishart distribution W(X ¢¢,n). The matrix S, 43).c and the 
matrix B.4g)c> are independent (Chapter 8). 

Consider testing the null hypothesis (25). This is testing the null hypothesis 
X48.c =90 against the alternative 2&5. #0. The determinant of (26) is 
[Sal =ISapcl«|Sccls the determinant of (28) is |] = 184 4l-lSgel *|Sccl- 
The likelihood ratio criterion is 


(30) [ie (een) 


f 


Sal ISqa-cl*|Spa-cl 


Since the sample covariance matrix S.49).c has the Wishart distribution 
WL casper — (py + Pg), where p, and p, are the numbers of components 
of X, and X, (Section 8.2), the criterion is, in effect, u 
studied in Sections 8.4 and 8.5. 

As another example, consider the graph in Figure 15.9. Note that node 4 
separates (1, 2,3) and (5, 6); nodes 1,4 separate 2 and 3; and node 4 separates 
5 and 6. These separations imply three conditional independences: 
(X,, Xz, X35) (Xs, X6)/X,, Xz IL X5/CX,, X,), and X, ll X,|X,. In terms of 
covariances these conditional independences are 


Pay Pe" —( patpp)? 


(123)(56)-4 “(123)(56) ~ (12374 “44 « 456) =v, 
(31) D3 y Diy hay 0 
(32) & 23.(14) a X23 _ Sacra) Lea4yetay Lerays = 0, 
(33) Ys6-4= B56 — B54 Va’ Ras = O. 

2 5 

4 
iT 
3 6 


Figure [5.9 


620 PATTERNS OF DEPENDENCE; GRAPHICAL MODELS 


In view of (31) the restriction (32) can be written as 
(34) Zos.t4 ae Zos.4 = Loa anaes =0. 


It will be convenient to reorder the subvectors as X,, X3, X,,X5,X5,X, to 
write 


Sx Sy, Sa, 
(35) S= 

62 Sse Sea 
S42 Sis Sag 

Sy 4 S64 Sx, Sx, 

; + Si4 [ Sea Sas 
.; Soy 4 Soar So So 
[ Ss2 Sy] S44 


-1 
Sco...6y2..6-4 + 5Sp2, oy Sta Sarg Sir..ey4 
S42...6) Sag 


The determinant of S is 
(36) IS] = 1S. oxa., eal * | Saal - 


If the condition (X,, X,, X,) 1 (X., X,)|X, is imposed, the maximum likeli- 
hood estimator is (35) with S,125,56).4 replaced by @ to obtain 


S is 0 
(231)(231)-4 ; 
(37) | . 5(56x56)-4 *S.2,,.6Saa Sua .§ Sa. os 


(2... 6) Sa 


The determinant of (37) is 


(38) | Scastyc231y-4 | “| S¢56¢56)-4 |:| Sug]. 


15.5 STATISTICAL INFERENCE 621 
The likelihood ratio criterion for 2,193)¢56).4 = 0 is 


nf2 


IS | 
(2...642...6)4 ios 
(39) = US8)(56)-4- 


[Seas eaany-a -| Si561(50)-4 


Here U.ayy(56).4 has the distribution of U, +p:+pyPs+Py.n—p section 8.4) since 


the distribution of Sy gy2,6-4 18 WLe.6y2...6-427 7 pa) independent 
of S,,. 


The first three rows and columns of S(.__ gy2._.5).4 Constitute the matrix 


(40) 
Soo.¢  Sa3.4 0 Sng 
Seas aati = |S, Sh3.4 314 
Sia.4 Si3-4 Shi 
Soig Say-14 be Sag 
fe Se ta ae ee ° 
i a Saad | S53. + | ak 12-4 13 4) Sa.4 : 
(Syo.4> 543.4) Sis 
Sx1¢ Stata Sy) -1 Sx, 4 
a Fae S334 : Saaay Scisrial Sci Sto) S314 , 
(Syo.5, Si3-4) Si, 
The determinant of (40) is 
(41) |Sianesn-a| = | Scosy237-14 [+ Siy-al- 


The estimator Of S,231)(233).4 With X2 L X;|X,, X, imposed is (40) with S., 44 
replace by 0 to obtain 


So7-44 0 Soy.4 1 Ssj:4 
— theo [8y-40Sp-4, Sha. : 
(42) | 0 S554 Si 11 a( ie eee i) a) Sa4 ; 
(S.: 47543.) Sing 


the determinant of which ts 


(43) | So2-14 [{Saaral]Siia]- 


622 PATTERNS OF DEPENDENCE; GRAPHICAL MODELS 


The likelihood ratio criterion for {3.4 = 0 is 


a/2 
(44) | Sca3y23)-141 : _ yn? 
[Soy aq] 1853 14! ae 


The statistic U2; 4 has the distribution of U,, ,..n-(p,+p2+ps+pq) ection 8.4) 
since S,35,23) 14 has the distribution W[ Zo, 23.14. = — (a, + p.)] independent 
of Soa 

The estimator Of Lisgysey.4 With 256.4 O imposed is Siseis6).4 With Sog.4 
replaced by 0 to obtain 


S55.4 0 
45 . 
(45) haw 
The likelihood ratio criterion for testing 2.,., = 0 is 
| S¢56)(56)-4| is 
46 eae 7 | 0 UE Ue. 
a | |S55.4l | Se6.41 ae 


The statistic 1, , has the distribution of uw, 9. n—(py+ps+p.6) 
has the distribution W(X <6ys¢).4:!2 — P4) independent of S44. 

The estimator of 2 under three null hypotheses is (37) with Stas ¢2sty.4 
replaced by (42) and Svseys6).4 replaced by (45). The determinant of this 
matrix is 


since $156)(56)-4 


(47) im ai |Soo.44l Shy 141 ISi, 4 1855.4 "| S66. . [Sal . 


The likelihood ratio criterion for testing the three null hypotheses is 


BS nse 
2 n = n/2 
(48) z = (Ups s0y-4U 1456.6) 6 
|Z. 

When the null hypothescs are true, the factors Ua3i)(56).42 Uay.14, and Use., are 
independent. Their distributions are discussed in Sections 8.4 and 8.5. In 
particular the moments of these factors are given and asymptotic expansions 
of distributions are described. 


15.5.4. Directed Graphs 


We suppose that the vertices are well-numbered, 1,...,; the N observations 
Xqyess+eX wy are made on X=(Xj,...,X,)’. The model is (22) to (25) or 
(26) of Section 15.3. Let =N7'DN_.x,,) and S=(N—1)7' D8 (x) - 
X44) — 4)’, The model (26) consists of x,= a, +2, and n— 1 regressions 


ACKNOWLEDGMENTS 623 


(23) to (25). The vector a, in x, =a@,+€, is estimated by x). If pale.) ts 
vacuous, ot, is estimated by x,: if pa(v.) is not vacuous and X,,,., =X. then 
B, and @, are estimated by 


N N “I 
(49) B,= L (20 ~En)| e (40H) Fee “A > 
a=] a=l 
(50) & =, + Bk, 
In general 
A N f 
(51) By = u [ 24s | | Shes eae 
N -I 
u [ Xpace,xa — ¥pacv| [ Xpace,xe) =F sa0y| ? 
(52) &, =k, +B 


J J J" pa(v,)* 


Conditional on x,,, bay the distribution of these estimators is normal. 


15.5.5. Chain Graphs 


The condition (C1) of Section 15.4 specifies that X,  X,|X,,,,, for ue V(r) 
and for v€ Via), where o End g(r) \ pag(7); that is, the past earlier than 
pa, (7) is independent of the present. This condition corresponds to the 
Markov property in time series analysis. Thus X, is in terms of deviations 


from the regression of X, on Xp, (7) 


EX\X =a,+B_.X 


pa y(7) pa «a(7)° 


The vector a, and the matrix B, are estimated as for directed graphs. 

The Markov property (C2) indicates the analysis in terms of deviations 
X,—@,- B,X,, ,¢2) The estimation of the structure of dependence within 
V(r) is carried out as in Section 15.5.2. 

The Markov property (C3*) specifies X,  X,|X,. v) for we U CV(r) 
and v € pag(U) Unb, (U). The property is a restriction on the regression of 
X, on X 


pa m(7)* 


ACKNOWLEDGMENTS 


The preparation of this chapter drew heavily on lecture notes of Michael 
Perlman. Thanks are also due Michael Perlman and Ingram Olkin for 
reading drafts of the manuscript. 


APPENDIX A 


Matrix Theory 


A.1. DEFINITION OF A MATRIX AND OPERATIONS 
ON MATRICES 


In this appendix we summarize some of the well-known definitions and 
theorems of matrix algebra. A number of results that are not always con- 
tained in books on matrix algebra are proved here. 

An m Xn matrix A is a rectangular array of real numbers 


Ayy 4g Ay 

Mg, 499 ** Aay 
(1) A=]. i g? MNs 

a ml ain ann 


which may be abbreviated (a;;), i=1,2,...,m, j=1,2,..., 2. Capital bold- 
face letters will be used to denote matrices whose elements are the corre- 
sponding lowercase letters with appropriate subscripts. The sum of two 
matrices A and B of the same numbers of rows and columns, respectively, is 
defined by 

(2) A+B=(a,,)+(,) =(a,;+9,,)- 

The product of a matrix by a real number A is defined by 

(3) AA=AX= (Aa,,). 


An Introduction to Multivariate Statistical Analysis, Third Edition. By T. W. Anderson 
ISBN 60-471-36091-0 Copyright © 2003 John Wiley & Sons, Inc. 


624 


A.1 DEFINITION OF A MATRIX AND OPERATIONS ON MATRICES 625 


These operations have the algebraic properties 


(4) A+B=B+A, 

(5) (4+ B)+C=A+(B+C). 
(6) A+(-1)A=(0), 

(7) (A+ u)A=AA +A, 
(8) dM A+B)=AA+AB, 
(9) A( A) = (An )A. 


The matrix (0) with all elements 0 is denoted as 0. The operation A + (—1B) 
is denoted as A —B. 

If A has the same number of columns as B has rows, that is, A = (a,,), 
i=1,...,4 j=1,....m, B=(by), f=1,...,m, k=1,...,n, then A and B 
can be multiplied according to the rule 


m 
(10) AB ~ (2)(b4)= [ Ea, 0], iS gek REM he 
J=1 


that is, AB is a matrix with / rows and n columns, the element in the ith row 
and kth column being Ly_,a,,b,. The matrix product has the properties 


(11) ( AB)C = A( BC), 
(12) A(B+C)=AB+AC, 
(13) (A +B)C =AC + BC. 


The relationships (11)-(13) hold provided one side is meaningful (i.¢., the 
numbers of rows and columns are such that the operations can be performed); 
it follows then that the other side is also meaningful. Because of (11) we can 
write 


(14) ( AB)C = A( BC) = ABC. 


The product #4 may be meaningless even if AB is meaningful, and even 
when both are meaningful they are not necessarily equal. 

The transpos* of the Xm matrix A =(a,,) is defined to be the mx! 
matrix A’ which has in the jth row and ith column the element that A has in 


626 MATRIX THEORY 


the ith row aud jth column. The operation of transposition has the proper- 
ties 


(15) (A’)' =A, 
(16) (A+B) =A'+B’, 
(17) (AB)' =B'A’, 


again with the restriction (which is understood throughout this book) that at 
least one side is meaningful. 

A vector x with m2: components can be treated as a matrix with m rows 
and one column. Therefore, the above operations hold for vectors. 

We shall now be concerned with square matrices of the same size, which 
can be added and multiplied at will. The number of rows and columns will be 
taken to be p. A is called symmetric if A=A’. A particular matrix of 
considerable interest is the identity matrix 


1 0 0 0 
0 1 0 0 

(18) y=/0 0 1 = (6,,), 
000 1 


where 6,,. the Kronecker delta, is defined by 

(19) 6, =1, i=j, 
=(), i¥j, 

The identity matrix satisfies 

(20) IA =AI=A. 


We shall write the identity as J, when we wish to emphasize that it is of 
order p. Associated with any square matrix A is the determinant | A], defined 
by 


pP 
(21) lal = Oo (-1)0 Ta, 
t=] 


where the summation is taken over all permutations (j;,..., jp) of the set of 
integers (1,..., p), and f(j,,...,j,) is the number of transpositions required 
to change (1,..., p) into (j,,-.., j,). A transposition consists of interchanging 
two numbers, and it can be shown that, although one can transform (1,..., p) 
into (j,,...,j,) by transpositions in many different ways, the number of 


A.1 DEFINITION OF A MATRIX AND OPERATIONS ON MATRICES 627 


transpositions required is always even or always odd, so that (—1)4J0-++/») is 
consistently defined. Then 


(22) |AB| = |A| *|B]. 
Also 
(23) |A| =|A’|. 


A submatrix of A is a rectangular array obtained from A by deleting rows 
and columns. A minor is the determinant of a square submatrix of A. The 
minor of an element a,; is the determinant of the submatrix of a square 
matrix A obtained by deleting the ith row and jth column. The cofactor of 
a,;, say Aj, is (— 1) times the minor of a,,. It follows from (21) that 


P P 
(24) |A| = 4,4, = V apAy. 


i=} i=] 


If |A| #0, there exists a unique matrix B such that 4B=J. Then B is 
called the inverse of A and is denoted by 4~!. Let a** be the element of A7! 
in the Ath row and kth column. Then 


(25) qhk — At 


The operation of taking the inverse satisfies 


(26) (AC) =C7'A7, 
since 
(27) (AC)(C7'A7') = A(CC7') A>! = AIA | = AA | =I, 


Also J7' = J and A~'A =J. Furthermore, since the transposition of (27) gives 
\A7'YA’ =J, we have (A7'¥ =(A’)7', 

A matrix whose determinant is not zero is called nonsingular. If |A] #0, 
then the only solution to 


(28) Az=0 


is the trivial one z=0 [by multiplication of (28) on the left by A7']. If 
|A| =0, there is at least one nontrivial solution (that is, z#0). Thus an 
equivalent definition of A being nonsingular is that (28) have only the trivial 
solution. 

A set of vectors Z,,...,2, iS said to be linearly independent if there exists 
no set of scalars c,,...,c,, not all zero, such that bj_,¢;z,=0. A qxXp 


628 MATRIX THEORY 


matrix D is said to be of rank r if the maximum number of linearly 
independent columns is r. Then every minor of order r+1 must be zero 
(from the remarks in the preceding paragraph applied to the relevant square 
matrix of order r+ 1), and at least one minor of order r must be nonzero. 
Conversely, if there is at least one minor of order r that is nonzero, there is 
at least one set of r columns (or rows) which is linearly independent. If all 
minors of order r + 1 are zero, there cannot be any set of r + 1 columns (or 
rows) that are linearly independent, for such lincar independence would 
imply a nonzero minor of order r +1, but this contradicts the assumption. 
Thus rank r is equivalently defined by the maximum number of linearly 
independent rows, by the maximum number of linearly independent columns, 
or by the maximum order of nonzero minors. 
We now consider the quadratic form 


p 
(29) x'Ax= > BX Xi 


ijet 


where x’ =(x,,...,x,) and A=(a;,) is a symmetric matrix. This matrix A 
and the quadratic form are called positive semidefinite if x'Ax > 0 for all x. If 
x'Ax >0 for all x #0, then A and the quadratic form are called positive 
definite. In this book positive definite implies the matrix is symmetric. 


Theorem A.1.1. If C with p rows and columns is positive definite, and if B 
with p rows and q columns, q <p, is of rank q, then B'CB is positive definite. 


Proof. Given a vector y #0, let x = By. Since B is of rank g, By =x #0. 
Then 


(30) y'( B’CB) y = ( By) C( By) 
=x'Cx>0. 


The proof is completed by observing that B’CB is symmetric. AS a converse, 
we observe that B’CB is positive definite only tf B is of rank q, for otherwise 
there exists y #0 such that By = 0. a 


Corollary A.1.1. If C is positive definite and B is nonsingular, then B'CB is 
positive definite. 


Corollary A.1.2. /f C is positive definite, then C~' is positive definite. 


Proof. C must be nonsingular; for if Cx = 0 for x #0, then x’Cx=0 for 
this x, but that is contrary to the assumption that C is positive definite. Let 


A.1 DEFINITION OF A MATRIX AND OPERATIONS ON MATRICES 629 


B in Theorem A.1.1 be C7'. Then B’'CB =(C7'Y CC! =(C™'Y. Transpos- 
ing CC7! =J, we have (C'YC’ =(C'YC=I. Thus C7! =(C"'Y. a 


Corollary A.1.3. The qq matrix formed by deleting p—gq rows of a 
positive definite matrix C and the corresponding p — q columns of C is positive 
definite. 


Proof. This follows from Theorem A.t.1 by forming B& by taking the p x p 
identity matrix and deleting the columns corresponding to those deleted 
from C. a 


The trace of a square matrix A is defined as tr A = L?_,a,,. The following 
properties are vzrified directly: 


{31) tr(A+B)=trA+trB, 


(32) tr AB = tr BA. 


A square matrix A is said to be diagonal if a, =0, i#j. Then |A| = 
Ti, 4,,, for in (24) |A] = a@,,A,,, and in turn A,, is evaluated similarly. 

A square matrix A is said to be triangular if a,,=0 for i>j or alterna- 
tively for i<j. If a,,=0 for i>j, the matrix is upper triangular, and, if 
a;;=0 for i<j, it is lower triangular. The product of two upper triangular 
matrices A,B is upper triangular, for the i, jth term (>j) of AB is 
Uf. 1@,b,,=0 since a,=0 for k<i and b,,=0 for k>j. Sintilarly, the 
product of two lower triangular matrices is lower triangular. The determinant 
of a triangular matrix is the product of the diagonal elements. The inverse of 
a nonsingular triangular matrix is trrangular in the same way. 


Theorem A.1.2. [fA is nonsingular, there exists a nonsingular lower triangu- 
lar matrix F such that FA = A* is nonsingular upper triangular. 


Proof. Let A = Aj. Define recursively A, = (a!8’) = F,_Ag.y, B= 2.0665 Ds 
where F,_; = (f{8"")) has elements 


(33) fer? =, fr lecoe, 
(a-h 
(34) fe cece et G "8... 
1,g-1 geo!) err oo ‘« 
g-lig-) 


-j = * 
(35) OY) = 0, otherwise. 


630 MATRIX THEORY 


Then 
(36) a=, b=p+ lespy fe lacyg- 1 
7 pa] ° od 
(37) TE) ae ft Ys LS ager. TS sia es 
(g-1) ,(87-)) 
a a 
: (2) 2 gle 2 =D p(g-i - ale-y bo acd ee 
(38) al} =ais Pf as = a5 ; L, J =By.00y De 


(g-1) 
apt. g-1 


Note that F = Foaprecs F; is lower triangular and the elements of A, in the 
first g — 1 columns below the diagonal are 0; in particular A* = FA is upper 
triangular. From |A| #0 and |F,_,|=1, we have |A,_,| #0. Hence 
aj),..., a8, _. are different from 0 and the last p — g columns of A,_, can 
be numbered so a‘87\9,_, #0; then f{&")) is well defined. a 


g7lego 
The equation FA = A* can be solved to obtain A = LR, where R= A* is 
upper triangular and L =F! (s lower triangular and has 1’s on the main 
diagonal (because F is lower triangular and has 1’s on the main diagonal). 
This is known as the LR decomposition. 


Corollary A.14. If A is positive definite, there exists a lower triangular 
nonsingular matrix F such that FAF' is diagonal and positive definite. 


Proof. From Theorem A.1.2, there exists a lower triangular nonsingular 
matrix F such that FA is upper triangular ard nonsingular. Then FAF’ is 
upper triangular and symmetric; hence it is diagonal. a 


Corollary A.1.5. The determinant of a positive definite matrix A is positive. 


Proof. From the construction of FAF’, 


ay) 0 0 
GO a «0 
(39) FAF'’=| 9 0 af 
0 0 OO vs af? 


is positive definite, and hence at >0, g=1,...,p, and 0<|FAF’| =|F\- 
|A| -|F| = [A]. a 


A.2. CHARACTERISTIC ROOTS AND VECTORS 631 


Corollary A.1.6. If A is positive definite, there exists a lower triangular 
matrix G such that GAG' = J. 


Proof. Let FAF' = D?, and let D be the diagonal matrix whose diagonal 
elements are the positive square roots of the diagonal elements of D?. Then 
C = D”'F serves the purpose. a 


Corollary A.1.7_ (Cholsky Decomposition). Jf A is positive definite, there 
exists a unique lower triangular matrix T (t,,=0, i<j) with positive diagonal 
elements such that A = TT". 


Proof. From Corollary A.1.6, A = G~1(G')~!, where G is lower triangular. 
Then T= G"! is lower triangular. = 


In effect this theorem was proved in Section 7.2 for A = VV’. 


A.2. CHARACTERISTIC ROOTS AND VECTORS 


The characteristic roots of a square matrix B are defined as the roots of the 
characteristic equation 


(1) |B - Al] =0. 


Alternative terms are Jatent roots and eigenvalues. For example, with 


ee 
B-{> sh} 
we have 
Ae. 2 e 2. \2 
(2) |B~ AN =]5 5 [7257 4- WA+ =H 1004 21. 
The degree of the polynomial equation (1) is the order of the matrix B and 
the constant term is |B]. 

A matrix C is said to be orthogonal if C’'C = I, it follows that CC’ = I. Let 
the vectors 3°’ =(x;,...,x,) and y’= (Yy,.-.,¥p_) Tepresent two points in a 
p-dimensional Euclidean space. The distance squared between them is 
D(x, y) = (x — y¥(x—y). The transformation z = Cx can be thought of as a 
change of coordinate axes in the p-dimensional space. If C is orthogonal, the 


652 MATRIX THEORY 


transformation is distance-preserving, for 
(3) D( Cx, Gy) = (Cy ~ Cry (Cy - Ce) 
=(y—xyC'C(y—x) = (yx) {y—x) = D(x, y). 


Since the angles of a triangle are determined by the lengths of its sides, the 
transformation z = Cx also preserves angles. It consists of a rotation together 
with a possible reflection of one or more axes, We shall denote ¥x'x by ||x|l. 


Theorem A2.1. Given any symmetric matric B, there exists an orthogonal 
matrix C such that 


d, 0 0 
0 d, 0 

(4) C'BC =D = 
0 60 d 


If B is positive semidefinite, then d,>0,i=1,..., p; if B is positive definite, then 
d,>0,i=1,...,p. 


The proof is given in the discussion of principal components in Section 
11.2 for the case of B positive semidefinite and holds for B symmetric. The 
characteristic equation (1) under transformation by C becomes 


(5) 0=|C'|:|B-Al-|¢| =|C'(B- ane 
= |C’BC — Al| =|D- All 
toe 9 
0 Gan, otek 0 p 
i , » |= I(4-a). 
0 0 oe dy—A 


Thus the characteristic roots of B are the diagonal elements of the trans- 
formed matrix D. 

If A, is a characteristic root of B, then a vector x, not i¢entically 0 
satisfying 
(6) (B=A,I)x,=0 


is called a characteristic vector (or eigenvector) of the matrix B corresponding 
to the characteristic root A;. Any scalar multiple of x, is also a characteristic 
vector. When B is symmetric, x(B—A,J)=0. If the roots are distinct, 
x,Bx,=O and xix,=0, i#j. Let ¢,=(/llx,|))x, be the ith normalized 


A.2) CHARACTERISTIC ROOTS AND VECTORS 633 


characteristic vector, and let C =(e,,...,¢,). Then C‘C=F and BC =CD. 
These lead to (4). If a characteristic root has multiplicity m, then a set of m 
corresponding characteristic vectors can be replaced by m linearly indepen- 
dent linear combinations of them. The vectors can be chosen to satisfy (6) 
and xjx,=O and xi Bx, =0, i #j. 

A characteristic vector lies in the direction of the principal axis (see 
Chapter 11). The characteristic roots of B are proportional to the squares of 
the reciprocals of the lengths of the principal axes of the ellipsoid 


(7) x’ Bx = | 


since this becomes under the rotation y = Cx 
p 
(8) 1=y’Dy= Vidy-. 
i=] 


For a pair of matrices A (nonsingular) and B we shall also consider 
equations of the form 


(9) |B— AA| =0. 


The roots of such equations are of interest because of their invariance under 
certain transformations. In fac , for nonsingular C, the roots of 


(10) |C'BC — A(C'AC)| = 0 
are the same as those of (9) since 
(11) [CBC ~ AC’AC| = |C'( B - AA)C| = |C'| |B AA -|C| 


and |C’| =|C| #0. 

By Corollary A.1.6 we have that if A is positive definite there is a matrix 
E such that E'AE = J. Let E’BE = B*. From Theorem A.2.1 we deduce that 
there exists an orthogonal matrix C such that C’B*C=D, where D is 
diagonal. Defining EC as F, we have the following theorem: 


Theorem A.2.2. Given B positive semidefinite and A positive definite, there 
exists a nonsingular matrix F such that 


A, 0 0 
0 A 0 
(12) F' BF = . 
0 0 A, 
(13) F'AF =I, 


where A, > +++ = A, (20) are the roots of (9). If B is positive definite. A, > 0 
b= Laan cp: 


634 MATRIX THEORY 


Corresponding to each root A, there is a vector x, satisfying 
(14) (B-A,A)x,=0 


and x\ Ax, = 1. If the roots are distinct x,Bx,=0 and x,Ax,=0, i#j. Then 
F=(x,,...,2,) If a root has multiplicity m, taen a set of m linearly 
independent x,’s can be replaced by m linearly independent combinations of 
them. The vectors can be chosen to satisfy (14) and xBx, = 0 and x, Ax, = 0, 
i#). 


Theorem A.2.3° (The Singular Value Decomposition), Given an n Xp 
matrix X, n> p, there exists an n Xn orthogonal matrix P, a p X p orthogonal 
matrix Q, and an nXp matrix D consisting of a pXp diagonal positive 
semidefinite matrix and an (n — p) X p zero matrix such that 


(15) X= PDQ. 


Proof. From Theorem A.2.1, there exists a p X p orthogonal matrix Q and 
a diagonal matrix E such that 


’ em E, 0 
(16) Qx' XQ (¢ ‘I: 


where E, is diagonal and positive definite. Let XQ'= Y=(Y, Y,), where the 
number of columns of Y, is the order of E,. Then ¥;¥,=0, and hence 
Y,=0. Let P,=Y¥,E,?. Then Pi{P,=I. An n Xv orthogonal matrix P= 
(P, P:) satisfying the theorem is obtained by adjoining P, to make 
P orthogonal. Then the upper left-hand corner of D is E,?, and the rest of D 
consists of zeros. a 


Theorem A.2.4. Let A be positive definite and B be positive semidefinite. 
Then 
x'Bx 


(17) A,< rae <A,, 


where A, and A, are the largest and smallest roots of (1), and 


(18) Nye Xp 


where A, and 4, are the largest and smallest roots of (9). 


Proof. The inequalities (17) were essentially proved in Section 11.2, and 
can also be derived from (4). The inequalities (18) follow from Theorem 
A.2.2, | 


A.3. PARTITIONED VECTORS AND MATRICES 635 


A square matrix A is idempotent if A* =A. If A satisfies |A— AJ| =0, 
there exists a vector x # 0 such that Ax =Ax =A’x. However, A*x = A(Ar) 
= AAx = )?x. Thus A* =A, and A is either 0 or 1. The multiplicity of A = 1 is 
the rank of 4. If A is p Xp, then I, —A is idempotent of rank p — (rank A), 
and A and I, — A are orthogonal. If A is symmetric, there is an orthogonal 
matrix O such that 


(19) oao' =|§ ‘| o(r- Ayo" =|} Al 


A.3. PARTITIONED VECTORS AND MATRICES 


Consider the matrix A defined by (1) of Section A.1, Let 


A;, = (4;,), i=1,...,p, #=1,....q; 
ay Ai = (4;,); f=1,...,p, J=qt,...,2, 
Ay = (4;,), i=pt+l,....m, J=1,....4q, 
Ay = (4,;), i=pt+l,...,m, J=qtl,...,n. 


Then we can write 


(2) Ax Ve Ay 


Ay Ax 


We say that A has been partitioned into submatrices A,;. Let B (m Xn) be 
partitioned similarly into submatrices B,,, i, j = 1,2. Then 


Ay +By Ant Bp 
Ay, +B, Ayn + By 


(3) A+B= 


Now partition C (n Xr) as 


Cr Cy | 
4 C= > 
“) i Cy 
where C,, and C,. have q rows and C,, and C,, have s columns. Then 
A A Cy € 
(5) Ac=t 0"! 12 mM 12 
Ay Ay j\Cy Cy 


Ay Cy tAyly, Ay Oi. + AC 
Ay Cy +AyCy Ag Cy + AyCy | 


636 MATRIX THEORY 


To verify this, consider an element in the first p rows and first s columns of 
AC. The /, jth element is 


nr 
(6) 2 Gees i<p, Js. 
k=l 


This sum can be written 


n 


q 
(7) Lo Ging, + a KCK; 


kat k=q+1 


The first sum is the i, jth element of A,,C,,, the second sum is the i, jth 
element of A,,C,, and therefore the entire sum (6) is the i, ‘th element of 
A,,C\, + AC. In a similar fashion we can verify that the other submatrices 
of AC can be written as in (5). 

We note in passing that if A is partitioned as in (2), then the transpose of 
A can be written 


Ay Ab, 


8 Aaf 
(8) A\y Ay 


If A,,=0 and A,, =0, then for A positive definite and A,, square, 


ce 


9 Avt= 
2 0 Ay 


The matrix on the right exists because A,, and A, are nonsingular. That the 
right-hand matrix is the inverse of A is verified by multiplication: 


Ai) 0 I 0) 
0 8 6AxS! fo i}: 


which ts a partitioned form of J,,. 
We also note that 


A 0 
(10) i 


0 A 


I 0 
0 A, 


10 I 


(11) 


=| Aj,|+| Aq) |. 


The evaluation of the first determinant in the middle ts made by expanding 
according to minors of the last row; the only nonzero element in the sum is 
the last, which is 1 times a determinant of the same form with J of order one 


A.3. PARTITIONED VECTORS AND MATRICES 637 


less. The procedure is repeated until |A,,| is the minor. Similarly, 


A A I 0 A A 
(12) II 12 = : WU 12 
0 Ay 0 Ay 0 I 
a [Ay |-| Al. 


A useful fact is that if A, of g rows and p columns is of rank q, there 
exists a matrix A, of p—gq rows and p columns such that 


(13) A= 


is nonsingular. This statement is verified by numbering the columns of A so 
that A,, consisting of the first g columns of A, ts nonsingular (at least one 
q Xq minor of A, is different from zero) and then taking A, as (0 J), then 


=|Ay|, 


A A 
14 Al = 11 12 
(14) | Al i ’ 


which is not equal to zero. 


Theorem A.3.1. Let the square matrix A be partitioned as in (2) so that A, 
is square. If Ax, is nonsingular, let 


= -1 
(15) B-( ApAxy F c-| I A 
0 I -AyAy I 
Then 
(16) BA= A AAn aa : AC = Ay-AyAyVAn An 
Ay, Ay |" 0 A» |’ 
Ay -ApApidn 0 
1 BAC = as 
(17) ; re 


[If A is symmetric, C = B’. 


Theorem A.3.2. Let the square matrix A be partitioned as in (2) so that A» 
is square. If A, is nonsingular, 


(18) |A| =|Ay, - A Ap'An | +1 4291. 


638 MATRIX THEORY 
Proof. Equation (18) follows from (16) because |B] = 1. a 


Corollary A.3.1. For C nonsingular 
C yl. se ee ae pid 
(19) . (tale wi=|) z|- rac y'C ly). 


Theorem A.3.3. Let the nonsingular matrix A be partitioned as in (2) so that 
Ay, is square. If Ay, is nonsingular, let A,,. = Ay — Ay Ay Ax. Then 


(20) At} = 


Aii2 -AyhApAz! 
-AAy Ai! 2 AR'AyA 2A An! + Ay! 


Proof. From Theorem A.3.1, 


A,,. 0 
2 = poiy is -t 
(21) A 0 ae Cc 
Hence 
oo | 

Aj; > 0 

(22) A'=C B 
0 An 

-Ax'A, IJ) 0 Az! 110 I 

Multiplication gives the desired result. a 


Corollary A3.2. If x’ =(x"" x") then 
(23) x’Ao'x=(x%~ A), Az e) Ap (2 - Ay Age) +2?! AZ x, 
Proof. From the theorem 


(24) 


x AT x = x01 AT) x0 - 2 AD Ap Ane 
— + 4-7] -1 >) (2) ot -I -i -1t)\ (2) 
OAD AWA AX +E (AD An A 2AypAn’ +An )x ? 


which is equal to the right-hand side of (23). a 


A.4. SOME MISCELLANEOUS RESULTS 639 


Theorem A.3.4. Let the nonsingular matrix A be partitioned as in (2) so that 
A,, is square. If Ay is nonsingular, 


_ -I rz = -I ee 2 
(25) (Ax - Ay Aj'Ay) =A7'An( Ay — Ay Az'Ay) Ay Az) + Ay. 


Proof. The lower right-hand corner of A~' is the right-hand side of (25) 
by Theorem A.3.3 and is also the left-hand side of (25) by interchange of 1 
and 2. Bl 


Theorem A.3.5. Let U be p Xm. The conditions for I, -— UU', I,, ~ U'U, 
and 


(26) 


to be positive definite are the same. 


Proof. We have 
I, U 
U's o£, 


m 


(27) (v' w’) 


v 
| =vp'v+v'Uwt+w'U'v +w'w 
w 


= v'(I,, — UU')v + (U'v + w)'(U'v +). 


The second term on the right-hand side is nonnegative; the first term is 
positive for all v # 0 if and only if I,,— U’U is positive definite. Reversing 
the roles of v and w shows that (26) is positive definite if and only if 
I, — UU’ is positive definite. a 


4.4. SOME MISCELLANEOUS RESULTS 


Theorem A.4.1, Let C be p X p, positive semidefinite, and of rank r (<p). 
Then there is a nonsingular matrix A such that 


,_{[f 9 
(1) aca’ =|" i 


Proof. Since C is of rank r, there is a(p—r)Xp matrix A, such that 
(2) A,C=0. 
Choose B (r X p) such that 


> A 


640 MATRIX THEORY 


is nonsingular. Then 
B : , BC F , BCB' 0 
(4) [2,]e2 4)-(5 | 4)-{5 AE 


This matrix is of rank r, and therefore BCB' is nonsingular. By Corollary 
A.1.6 there is a nonsingular matrix D such that D(BCB’)D' =J,. Then 


DB D 0\{8 
© a-[ar}-(e (4, 
2 2 
is a nonsingular matrix such that (1) holds, | 


Lemma A4.1. If E is p Xp, symmetric, and rtonsingular, there is a nonsin- 
gular matrix F such that 


(6) FEF’ = (2 , 


where the order of I is the number of positive characteristic roots of E and the 
order of ~J is the number of negative characteristic roots of E. 


Proof. From Theorem A.2.1 we know there is an orthogonal matrix G 
such that 


h, 0 0 
0 A, 0 

(7) GEG' = , 
0 (0 h 


A.4 SOME MISCELLANEOUS RESULTS 641 
Then 


(9) KGEG'K’ = ( KG) E( KG)! = (4 a 7 


Corollary A.4.1. Let C be p Xp, synmmerric, and of rank r (<p). Then 
there is a nonsingular matrix A such that 


1 0 0) 
(10) ACA'=|0 =I ul 
0 0 0 


where the order of I is the number of positive characteristic roots of C and the 
order of —J is the number of negative characteristic roots, the sum of the orders 
being r. 


Proof. The proof ts the same as that of Theorem A.4.1 except that Lemma 
A.4.1 1s used instead of Corollary A.1.6. a 

Lemma A4.2. Let A be n Xm (n> m) such that 
(11) A'A=I,,. 
There exists ann X(n— m) matrix B such that (A B) is orthogonal. 

Proof. Since A is of rank m, there exists an #7 X (+ — m7) matrix C such 
that (A C) is nonsingular. Take D as C-—AA’C; then D'A=0. Let E 


[(n — m) X (n— m)] be such that E’D'DE = 1. Then B can be taken as DE. 
|_| 


Lemma A.4.3. Let x be a vector of 1 components, Then there exists an 
orthogonal matrix O such that 


(12) Ox=|-4, 


where c= yx'x. 


Proof. Let the first row of O be (1 /c)x’. The other rows may be chosen in 
any way to make the matrix orthogonal. a 
Lemma A4.4. Let B=(b,,) be a p X p mewrix. Then 


Al Bl os 
(13) op = B,,- i,j= dec p. 
‘j 


642 MATRIX THEORY 


Proof. The expansion of |B| by elements of the ith row is 
p 
(14) Bl = Yb, Bey 
A= 


Since 8, does not contain b,,, the lemma follows. a 


ay? 


Lemma 4.4.5. Let b= 8,(c,...,¢,) be the i, jth element of a pXp 
matrix B. Then for g=1,...,n, 


Bl > AB] OBiglCiy--15€n) _ > B, IB y( Cr 0++1Cn) 


| 
(15 
) Oc, net OD, OC, ai OC, 
Theorem A4.2. [fA=A’', 
3|A| 
16 = HA); 
( ) oa, hy 
6|A| ae 


Proof. Equation (16) follows from the expansion of |A] according to 
elements of the ith row. To prove (17) let b,, =), =a,,, i,f=1,.-.,p, i<j. 
Then by Lemma A.4.5, 


(18) ca, =B,+B,,, 

Since |A| = |B| and B, =B, =A, =A), (17) follows. | 
Theorem A.4.3. 

(19) © (x'Ax) = 2dr, 


where 3/0x denotes taking partial derivatives with respect to each component of x 
and arranging the partial derivatives in a column. 


Proof. Let h be a column vector of as many components as x. Then 
(20) (x+h)'A(xt+h) =x'Axt+h'Axt+x'Ah+h'Ah 
=x'Ax+2h'Ax+h'Ah, 
The partial derivative vector is the vector multiplying hk’ in the second term 


on the right. a 


Definition A4.1. Let A =(a,,) be a p Xm matrix and B=(b,,) beag Xn" 
matrix. The pq X mn matrix with a,,b,g as the element in the i, ath row and the 


A.4_ SOME MISCELLANEOUS RESULTS 643 


j, Bth column is called the Kronecker or direct product of A and B and is 
denoted by A ® B, that is, 


a,,B ayB a,,B 
a,,B a8 A 2™ B 
a,,B 4,,8 a ymB 


Some properties are the following when the orders of matrices permit the 
indicated operations: 


(22) (A @B)(C®D)=(AC) @( BD), 
(23) (A @B)'=A7' @B™, 


Theorem A.4.4. Let the ith characteristic root of A (p Xp) be A, and the 
corresponding characteristic vector be x, = (x,,,..-, X,;)', and let the ath root of 
B (q Xq) be v, and the corresponding characteristic vector be y,, @ = 1,...54. 
Then the i, ath root of A @B is A;v,, and the corresponding characteristic vector 
is 0, @ yy = (XyYareeer Xp Var’s i=1,...,p, @=1,...,g. 


Proof. 


a, B ~ a,,B Xt Va 


(24) (A @B)(x,8y,) = 


pl 77 pp Xai da 


Ax, By, Xi Vu 


644 MATRIX THEORY 
Theorem A.4.5 
(25) |4@ B| =|A|7|B/’. 
Proof. The determinant of any matrix is the product of its roots; therefore 
Po 4 


(26) |4eB| =|] ane (TTA) (IL) a 


ist ae i=l a= 

Definition A4.2. Jf the pXm matrix A=(a,,...,0,,), then vec A= 
Cais galt 

Some properties of the vec operator [e.g., Magnus (1988)] are 


(27) vec ABC = (C' @A)vec B, 
(28) vec xy’ =y @x. 


Theorem A.4.6. The Jacobian of the transformation E = Y~' (from E to Y) 
is |¥| ~?”, where p is the order of E and Y. 


Proof. From EY =I, we have 


red red 
(29) ( sp]? +8( SY) -0, 
where 
dey Ae 1p 
06 0g 
, ; 
(30) (e)=| 
dey) DC yp 
36 06 
Then 
ay a = _y-lf 2 yly-l 
(31) (5. = -#( HY |E= y (sox)¥ 
If 6=y.,, then 
o 
(32) ce = —Ee,,E = —€.,€,., 


A.4 SOME MISCELLANEOUS RESULTS 645 


where €,, 1S a pXp matrix with all elements 0 except the element in the 
ath row and 6th column, which is 1; and e., isthe ath column of E and eg. 


is its Bth row. es 8€,,/ ag = ~€,,€, Then the Jacobian is the determi- 
nant of a p? X p* matrix 


(33) mod) F< lentyl= IESE) = EE = le" = |r" 


Theorem A.4.7, Let 4 and B be symmetric matrices with charactenstic roots 
a; 2a,2> °° 24, and b,>b,>--- =b,, respectively, and let H be a pxp 
orthogonal matrix. Then 


(34) max tr HAH'B = abs min HA'H'B = oer 
J=I )=1 


ptl-;" 


Proof. Let A=H,D,H}; and B=4H,D,H;,, where H, and H, are orthog- 


onal and D, and D, are diagonal with diagonal elements a,,...,a, and 
by,...,b, respectively. Then 
(35) max tr H*AH*'B = max tr H*H,D,H,H*'H,D,H, 

H* H* 


max tr H,W*H,D,(4,H*H,)'D 
H* 


maxtr HD,H'D,, 
H 


where H =H, H*H,. We have 


Pp 
(36) trHD,H'D, = x (AD, H'),,b, 
po) i P 
= L (AD,A') ;,(b,— 644) +6, L (AD,A’),, 
t=] jel j=l 
p-~tou P 
s Y4,(b,- ba) +8, 2a; 
i=] j=] y=t 


by Lemma A.4.6 below. The minimum in (34) is treated as the negative of the 
maximum with B replaced by —B [von Neumann (1937)]. | 


646 MATRLX THEORY 


Lemma A.4.6. Let P=(p,,) be a doubly stochastic matrix (p,,20, 
Le Py = 1 UP p, = 1. Lety,2y2> °° By, Then 


k kon 
(37) Ly 2S be Piy¥)s k=1,...,p. 
= 1=1 f= 
Proof. 
kK oP p 
(38) LY ayy, = Lg,y% 
jel j=l j=l 


where g, = ee Pyro J = daveea P (O<g, <1, LP ia= k). Then 


Pp k k B 
39) Ligy-YLy=- Ly tnle 3 Lg,|+ By; 
imi i=l fol ‘taal | [= 
k B 
= X ( =e ee 1) + > (¥; ~ Ye) 8, 
j=l f=k4+l 
<0. a 


Corollary A.4.2. Let A be a symmetric matrix with characteristic roots 
a,>a,>---a,. Then 


k 
(40) max trR'AR= }ia;. 
R'R=I, mr 
Proof. In Theorem A.4.7 let 
I, 0 
(41) B-| : | 7 
0 0 
Theorem A.4.8. 
(42) [F+xC] =14xtrC+ O(x7). 


Proof. The determinant (42) is a polynomial in x of degree p; the 
coefficient of the linear term is the first derivative of the determinant 
evaluated at x=0. In Lemma A4.5 let n=1, c, =x, B(x) = 6, +xc,,, 
where 6, = 1 and 6, =0,i#hA. Then d8,,(x)/dx =c,,, B, = 1 for x = 0, and 
B,, =Qfor x= 0, i#h. Thus 


P 
(43) EEDA = y: Ci; a 


A.5 ORTHOGONALIZATION AND SOLUTION OF LINEAR EQUATIONS 647 


A.5. GRAM-~SCHMIDT ORTHOGONALIZATION AND THE 
SOLUTION OF LINEAR EQUATIONS 


A.5.1. Gram—Schmidt Orthogonalization 


The derivation of the Wishart density in Section 7.2 included the 
Gram-—Schmidt orthogonalization of a set of vectors; we shall review that 
development here. Consider the p linearly independent n-dimensional vec- 
tors »,,...,2, (p <n). Define w, =v, 
i-| f 
(1) Ww, =v, - a, PSs ps 
= Isl 
Then w,# 0, i=1,..., p, because Vy,.++,V, are linearly independent, and 
wiw, = 0, 1+ j, as was proved by induction in Section 7.2. Let u; = (1/[lw,lD;, 
i=1,...,p.Then uy,,...,u, are orthonormal, that is, they are orthogonal and 
of unit length. Let U = (,...,u,). Then U'U = J. Define ¢,; = ||w,ll (> 0), 


(2) y= De wee jeHl,..,i-1, i=2,...,p, 


and t,,=0, j=it,...,p,i=1,...,p—1. Then T= (f,,) is a lower triangu- 
lar matrix. We can write (1) as 


i—I i \ 

(3) v,=l|lwillu,+ Yo (vu )uj= Lt, $a Tee py 
J=l y=l 

that is, 

(4) V=(0,,...,0,) = UT". 

Then 

(5) A=V'V=TU'UT’ =TT' 


as shown in Section 7.2. Note that tf V is square, we have decomposed an 
arbitrary nonsingular matrix into the product of an orthogonal matrix and an 
upper triangular matrix with positive diagonal elements; this is sometimes 
known as the QR decomposition. The matrices U and T in (4) are unique. 
These operations can be done in a different order. Let V= (v{,..., vo). 


For k=1,..., p — 1 define recursively 
(6) tee WOM, tty = oe ett = aol 
, oe Sik ; 
(7) ty = ou, J=k+1,...,p, 


(8) v= pF — teu, j=k+1,...,p. 


648 MATRIX THEORY 


Finally ¢,, = ||v§?~ || and u,=(1/t,,)v{?-Y. The same orthonormal vectors 
U,,-..,4, and the same triangular matrix (f,,) are given by the two proce- 
dures. 

The numbering of the columns of V is arbitrary. For numerical stability it 


is usually best at any given stage to select the largest of Eaaates || to call t,,. 


Instead of constructing w; as orthogonal to w,,...,w;;, We Can equiva- 
lently construct it as orthogonal to u;,...,v,_,. Let w, =v,, and define 
i-t 
(9) w=, + YL fy,%, 
jel 
such that 
i-l 
(10) 0 = UW, = U,V, + x ALA 
yal 
i-l 
=a,,+ Laas fis h=1,...,i-1. 


j=l 
Let F=(f,,), where f,;=1 and f,, =0, i<j. Then 
(11) W=(1,-..,¥,) = VE". 


Let D, be the diagonal matrix with |lw,||=4,; as the jth diagonal element. 
Then U = WD,‘ = VF'D;". Comparison with V= UT’ shows that F = DT~'. 
Since A=TT', we see that FA = DT’ is upper triangular. Heice F is the 
matrix defined in Theorem A.1.2. 

There are other methods of accomplishing the QR decomposition that 
may be computationally more efficient or more stable. A Householder matrix 
has the form H=J, — 2aa', where a’a = 1, and is orthogonal and symmet- 
ric. Such a matrix H, (i.e., a vector a) can be selected so that the first 
column of H,V has 0s in all positions except the first, which ts positive. The 
next matrix has the form 


(12) i= (q | -2{0}( w= (q eee 


The (n — 1)-component vector a is chosen so that the second column of H\|V 
has all 0’s except the first two components, the second being positive. This 
process is continued until 


(13) H,_.- HpHV= len 


A.S| ORTHOGONALIZATION AND SOLUTION OF LINEAR EQUATIONS 649 
where T’ is upper triangular and 0 is (n — p) Xp. Let 
(14) H’ =H," H,, = (4 H), 


where H“) has p columns. Then from (13) we obtain V= HT’. Since the 
decomposition is unique, H“ = U. 

Another procedure uses Givens matrices. A Givens matrix G,, is I except 
for the elements g;,= cos @=g,; and g,, =sin@= —g,,, i *j. It is orthogonal. 
Multiplication of V on the left by such a matrix leaves all rows unchanged 
except the ith and jth; @ can be chosen so that the i. jth element of G,,V 
is 0. Givens matrices G,,,...,G,, can be chosen in turn so G,, °-* G,,V has 
all 0’s in the first column except the first element, which is positive. Next 
Gy,...,G,. can be selected in turn so that when they are applied the 
resulting matrix has 0’s in the second column except for the first two 
elements. Let 


15 G'=G), °° GG G. = (G GG). 
ai 14 32 p 
Then we obtain 
fy ' T’ —nhysyr 
(16) V=G')) |=G°r, 
and GY =U. 


A.5.2. Solution of Linear Equations 


In the computation of regression coefficients and other statistics, we need to 
solve linear equations 


(17) Ax =y, 


where A is p Xp and positive definite. One method of solution is Gaussian 
elimination of variables, or pivotal condensation. In the proof of Theorem 
A.1.2 we constructed a lower triangular matrix F with diagonal elements 1 
such that FA = A* is upper triangular. If Fy = y*, then the equation is 

(18) A*x=y"*. 

In coordinates this 1s 


P 
(19) D anx, =yr. 


i=) 


650 MATRIX THEORY 


Let af* sah /at, Mf" =F /at, Jai itl... p, i= 1,...,p. Then 
(20) a =. ea 


these equations are to be solved successively for XpsXpoiyeees ys The calcula- 
tion of FA = A* is known as the forward solution, and the solution of (18) as 
the backward solution. 

Since FAF’ = A* F' = D? diagonal, (20) is A**x = y**, where A** = D~7A* 
and y** = D~*y*. Solving this equation gives 


(21) x= Attn! yh os Fl yt 


The computation is 


pol 


(22) x= Foo r DAE sp SF ys 


The multiplier of y in (22) indicates a sequence of row operations which 
yields A~!. 

The operations of the forward solution transform A to the upper triangu- 
lar matrix A*. As seen in Section A.5.1, the triangularization of a matrix can 
be done by a sequence of Householder transformations or by a sequence of 
Givens transformations. 

From FA = A*, we obtain 


Pp 
(23) |A| = [ Jas, 


2=1 


which is the product of the diagonal elements of A*, resulting from the 
forward solution. We also have 


(24) y'A-ty = (Fy)'D~?( Fy) =y*'D~?y* 
= y* ‘y *e . 
The forward solution gives a computation for the quadratic form which 
occurs in T* and other statistics. 
For more on matrix computations consult Golub and Von Loan (1989). 


APPENDIX B 


Tables 


ow oN wvWth WN 


—_ —_ _ 
wah 


20 


1.295 
1.109 
1.058 
1.036 
1.025 


1.018 
1.014 
1.011 
1.009 
1.007 


1.005 
1.003 
1.002 
1.001 
1.000 
1.000 


12.5916 21.0261 


TABLE B.1 
WILks’ LIKELIHOOD CRITERION: Facrors C( p, m, M) 
TO ADJUST TO ¥7,, WHERE M=n—-pt+1 


1.535 1.632 
1.241 1.302 
1.145 1.190 
1.099 1.133 
1.072 1.100 
1.056 1.078 
1.044 1.063 
1.036 1.052 
1.030 1.043 
1.025 1.037 
1.019 1.028 
1.013 1.020 
1.008 1,012 
1,004 1.006 
1,001 1.002 
1,000 1,000 


5% Significance Level 
p=3 


10 


1.716 
1.359 
1.232 
1.167 
1.127 


1.101 
1.082 
1.068 
1,058 
1.050 


1.038 
1.027 
1.017 
1,009 
1,002 
1.000 


1.791 
1.410 
1,272 
1.199 
1.154 


1.123 
1,101 
1.085 
1.073 
1.063 


1.048 
1.035 
1.022 
1.011 
1.003 
1.000 


1.059 
1.043 
1.028 
1.015 
1.004 
1.000 


1.018 
1.006 
1,000 


28.8693 36.4150 43.7730 50.9985 58.1240 65.1708 


An Introduction to Multivariate Statistical Analysis, Third Edition. 
ISBN 0-471-36091-0 Copyright © 2003 John Wiley & Sons, Inc. 


By T. W. Anderson 


1.081 
1.060 
1.040 
1.021 
1.007 
1,000 


72.1532 


651 


TABLE B.1 (Continued) 


5% Significance Level 


12 


1 2.021 1.407 1.451 1.517 1.583 1.644 1.700 1.751 
2 1.580 1.616 1.161 1.194 1.240 1.286 1.331 1.373 1.413 
3 1.408 1.438 1.089 1.114 1.148 1.183 1.218 1.252 1.284 
4 1.313 1.338 1.057 1.076 1.102 1.130 1.159 1.186 1.213 
5 1.251 1.040 1.055 1.076 1.099 1.122 1.145 1.168 
6 1.208 1.030 1.042 1.059 1.078 1.097 1.118 1.137 
7 1.176 1.193 1.023 1.033 1.047 1.063 1.080 1.097 1.115 
8 1.151 1.167 1.018 1.027 1.038 1.052 1.067 1.082 1.097 
9 1.132 1.147 1.015 1.022 1.032 1.044 1.057 1.070 1.084 
10 1.116 1.129 1,012 1.018 1,027 1.038 1.049 1.061 1.073 
12 1.092 1.009 1.014 1.020 1.029 1.038 1.047 1.058 
15 1.069 1.078 1,006 1.009 1.014 1.020 1.027 1.035 1.042 
20 1.046 1.052 1.003 1.006 1.009 1.013 1.017 1.022 1.027 
30 1.025 1.029 1.002 1.003 1,004 1.006 1.009 1,011 1,014 
60 1.008 1.009 1.000 1.001 1.001 1.002 1.003 1,003 1.004 
00 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 
xia 79.0819 85.9649) 15.5073 26.2962 36.4150 46.1943 55.7585 65.1708 74.4683 


TABLE B.1_ (Continued) 


5% Significance Level 
p= 5 


a“ 


1 1.799 1.843 1.884 1.503 1.483 1.514 1.556 1.€00 1.643 
2 1.450 1.485 1,518 1.209 1.216 1.245 1.280 1.315 1.350 
3 1.314 1.343 1.371 1.120 1,130 1.154 1.182 1211 1.240 
4 1.239 1.264 1.288 1.079 1.089 1.108 1.131 1.155 1.179 
5 1.190 1.212 1.233 1.056 1.065 1.081 1.100 1,120 1.141 
6 1.157 1.176 1,194 1.042 1.050 1.063 1.079 1.097 1.114 
7 1.132 1.149 1.165 1.033 1.040 1.051 1.065 1.080 1.095 
8 1.113 1.128 1.143 1.026 1.032 1.042 1.054 1.067 1,081 
9 1.098 1.111 1.125 1.022 1.027 1.035 1.046 1.057 1.070 
10 1.086 1.098 1.110 1.018 1,023 1.030 1.039 1.050 1.061 
12 1.068 1.078 1.088 1.013 1.017 1.023 1.030 1.038 1.047 
15 1.050 1.058 1.066 1.009 1.011 1.016 1.021 1.028 1.034 
20 1.033 1.039 1.045 1.005 1.007 1.010 1.013 1.018 1.022 
30 1.018 1.021 1.024 1.002 1.003 1.005 1.007 1.009 1.012 
60 1.005 1.007 1.008 1.001 1.001 1.001 1.002 1.003 1.004 
a) 1,000 1,000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 
Xen 83.6753 92,8083 101.879 | 18.3070 31.4104 43.7730 55.7585 67.5048 79.0819 


= 
a“ 
3 


OoOumxa A VEWN 


= 


_— 
wmv 


8Ss8 


= 
a“ 


ooawe~ wn wWtwh re 


_ = 
Ww 


p=5 
14 16 
1.683 1.722 
1,383 1.415 
1.267 1.294 
1.203 1.226 
1.161 1.181 
1.132 1.150 
1.111 1.127 
1.095 1,109 
1.082 1.095 
1.072 1.083: 
1.057 1.066 
1.042 1.049 
1.027 1,033 
1.014 1.018 
1.004 1.006 
1.000 1.000 


90.5312 101.879 


TABLE B.1 


2 


1.587 
1,254 
1.150 
1.100 
1.072 


1.055 
1.043 
1.035 
1.029 
1.024 


1.018 
1.012 
1.007 
1.003 
1.001 
1,000 


21.0261 


5% Significance Level 


p=6 
6 8 10 
1,520 1,543 1.573 
1.255 1.279 1.307 
1.163 1.184 1.208 
1,116 1.134 1.154 
1.088 1.103 1.120 
1.069 1.082 1.097 
1.056 1.068 1.081 
1.046 1.057 1.068 
1.039 1,048 1.059 
1,034 1.042 1.051 
1.025 1.032 1.040 
1.018 1.023 1.029 
1.011 1,014 1.018 
1.006 1.007 1.010 
1,002 1.002 1.003 
1,000 1.000 1.000 
50.9985 65.1708 79.0819 


(Continued) 


TABLE B.1 (Continued) 


$% Significance Level 


1.530 


1.266 1.282 
1.173 1.189 
1.124 1.139 


1.095 


1.075 


1.062 1.071 
1.051 1.060 
1.043 1.051 


1.037 
1.029 


1.044 


1.020 1.024 
1.013 1.016 
1.006 1.008 
1.002 1.002 
1.000 1.000 


§8.1240 74.4683 


1.303 
1,208 
1.155 


1,083 
1.070 
1.060 


1.031 
1.019 
1.010 
1.003 
1.000 


90,5312 | 26.2962 


92.8083 


p=7 
2 4 
1.66? 1 550 
1.297 1.263 
1.178 1.165 
1.121 1.116 
1,089 1.087 
1.068 1.068 
1.054 1.055 
1.044 1.045 
1.036 1.038 
1.031 1.032 
1.023 1.024 
1.016 1.017 
1.010 1.011 
1.005 1,005 
1.001 1,001 
1,000 1.000 


23.6848 41.3371 


p=9 alle =10 
2 4 6 2 

1.791 1614 1558 | 1.847 
1373 1309 1.293 | 1,408 
1.232 1.201 1196 | 1.257 
1.162 1.144 1.144 | 1.182 
1.121) 81110 1.112 | 1,137 
1,094 1088 1.090 | 1.107 
1.076 1.071 1.074 | 1.087 
1.062 1.060 1.062 | 1.072 
1.052 1.050 1,053 | L061 
1.045 1.043 1.046 | 1.052 
1034 1033 1.035 | 1.039 
1023 1,023 1.025 | 1028 
1014 1.015 1016 |; 1.017 
1.007. 1007 1.008 | 1.009 
1.002 1.002 1002 | 1.002 
1.000 1000 1000 | 1.000 

83.6753 | 28.8693 50.9985 72.1532) 31.4104 


653 


TABLE B.1 (Continued) 
1% Significance Level 
p=3 

4 6 8 10 12 14 16 

1,514 1.649 1.763 1.862 1.949 2.026 2.095 
1.207 1.282 1.350 1.413 1.470 1.523 1.571 
1,116 1.167 1.216 1.262 1.306 1.346 1,384 
1.076 1.113 1.150 1.187 1.22] 1.254 1.285 
1.054 1.082 1.112 1.141 1.170 1,198 1,224 


1.040 1.063 1.087 1.112 1,136 1.159 1.182 
1.031 1.050 1.070 1.091 1.111 1.132 1.152 
1.025 1.041 1.058 1.075 1.093 1.111 1.129 
1.021 1.034 1,048 1.064 1.080 1,095 1.111 
1.017 1.028 1.041 1.055 1,069 1,082 1.097 


1.012 1.021 1.031 1.042 1.053 1.064 1.076 
1.009 1.014 1.021 1.030 1,038 1.047 1.056 
1.005 1.009 1.013 1.019 1.024 1.030 1.036 
1.002 1.004 1.007 1.009 1.012 1.016 1.019 
60 1.000 1.001 1.001 1.002 1.003 1.004 1.005 1.006 
oo 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 


Non 16.8119 26.2170 34.8053 42.9798 50.8922 58.6192 66.2062 73.6826 


TABLE B.1 (Continued) 


1% Significance Level 


1 2.158 2.216 2.269 1.490 1,550 1.628 1.704 1.774 
2 1.616 1,657 1,696 1,192 1,229 1,279 1.330 1.379 
3 1.420 1.453 1.485 1.106 1.132 1.168 1.207 1.244 
4 1.315 1,344 1.371 1.068 1,088 1,115 1.146 1.176 
5 1,249 1,274 1,297 1.047 1.063 1.085 1.109 1,134 
6 1,204 1,226 1,246 1.035 1.048 1.066 1.086 1.107 
7 1,171 1,190 1,209 1.027 1.037 1,052 1.070 1,088 
8 1.146 1,163 1,180 1.021 1.030 1,043 1.058 1,073 
9 1.127 1.142 1.157 1.017 1.025 1.036 1.048 1.062 
10 1.111 1.125 1.139 1,014 1.021 1.030 1.041 1.054 
12 1.087 1.099 1.116 1.010 1.015 1.023 1.031 1.041 
15 1.065 1.074 1.083 1.007 1.010 1.016 1.022 1.029 
20 1.043 1.049 1.056 1,004 1,006 1,010 1.014 1.019 
30 1.023 1.027 1.031 1.002 1.003 1.005 1.007 1.009 
60 1.007 1.009 1.010 1.000 1.001 1.001 1.002 1.003 
00 1.000 1,000 1,000 1.000 1.000 1.000 1.000 1.000 
Xi 81.0688 88.3794 95.6257 | 20.0902 31.9999 42.9798 53.4858 63.6907 


654 


TABLE B.1 (Continued) 
1% Significance Level 


1.838 1.896 1.949 1.999 2.045 1.606 1.589 1.625 
1.424 1.467 1.507 1.545 1.580 1.248 1.253 1.284 
1.280 1.314 1.347 1.378 1.406 1.141 1,150 1.175 
1.205 1.234 1.261 1.287 1.313 1,092 1,101 1.121 
1.159 1.183 1.207 1.230 1.252 1.065 1.074 1.050 


1.128 1.149 1.169 1.189 1.208 1.049 1.056 1.070 
1.106 1.124 1.142 1.160 1.177 1.038 1.044 1.056 
1.089 1.105 1.121 1.137 1.153 1.031 1.036 1.046 

9 1.076 1.091 1.105 1.119 1.133 1.025 1,030 1.039 
10 1.066 1,079 1.092 1.105 1.118 1.021 1,025 1.033 


i 2 oad wa Fw Ne 


12 1.051 1.062 1.073 1.083 1,094 1.015 1.019 1.025 
15 1.097 1.045 1.053 1.062 1.071 1.010 1.013 1.017 
20 1.024 1.029 1.035 1.041 1.047 1.006 1.008 1.011 
30 1.012 1.015 1.019 1.022 1.026 1.003 1,004 1.005 
60 1.004 1.005 1.006 1.007 1.008 1.001 1,001 1.001 
00 1.000 1.000 1.000 1,000 1.000 1,000 1.000 1.000 


Xa 73.6826 83.5134 93.2168 102.8168 112.3292 | 23.2093 37.5662 50.8922 


1.672 1.721 1.768 1.813 1,855 
1.321 1.359 1.396 1,431 1.465 
1.204 1.235 1.265 1.294 1,323 
1.145 1.171 1.196 1.221 1.245 
1.110 1,153 1.174 1.196 


1.087 1,124 1.143 
1.071 1.087 1,103 1.119 1.136 
, 1.059 1.073 1.087 1.102 1.116 
1.050 1.062 1.075 1.088 1.101 


1.707 1.631 1.656 
1.300 1.294 1,319 
1.175 1,183 1.205 
1.116 1.129 1.148 
1,084 1.097 1,113 


1,063 1.076 1,090 
1.050 1.061 ~—-1.074 
1.040 1.051 1,062 
1,033 1,043 ~—«:1.052 


wow wr aH vA Fw N = 


10 1.043 1.065 1,077 1.028 1.037 1.045 
12 1.033 1.041 1,051 1.060 1.070 | 1.021 1,028 1,035 
15 1.023 1,030 1.037 1,044 1.052 1,014 1.020 1.024 
20 1.015 1.019 1.024 1.029 1.034 | 1.008 1,012 1.015 
30 1,007 1.010 1.012 1.015 1.019 1.004 1.006 1,008 
60 1.002 1,003 1.004 1.005 1.006 | 1.001 1.002 1.002 
ro) 1.000 1.000 1,000 1,000 1.000 1.000 1.000 1.000 
Koei 63,6907 76.1539 88,3794 100.425: 112.329) | 26.2170 58.6192 73.6826 


655 


TABLE B.1 (Continued) 


1% Significance Level 


“ 


1 1.797 1.667 1.642 1.648 1.666 
2 1.348 1.305 1,306 =: 1.321 1.342 
3 1.207 1.188 1.194 1.210 1,229 
4 1.140 1.130 1.138 1,152 1.169 
5 1.102 1.097 1.105 1.117 1.132 
6 1,078 1.076 1.083 1.094 1.107 
7 1.062 1.061 1.067 1.077 1,089 
8 1.050 1,050 1.056 1.065 1.075 
9 1.042 1.042 1.047 1,055 1.065 
10 1.035 1.036 1.041 1,048 1,056 
12 1.026 1.027 1.031 1.037 1.044 
15 1.018 1,019 1022 4.025 1,032 
20 1.011 1.012 1.014 1.017 1.020 
30 1.005 1.006 1.007 1.009 1,011 
60 1,003 1.004 1.001 1,002 1.002 1.003 1.003 
oO 1,000 1.000 1,000 1.000 1,000 1.000 1.000 
Xen 88.3794 102.816 | 29.1412 482782 66.2062 83.5134 100.425 


1% Significance Level 


p=9 p= 
2 4 6 2 


1.953 1,740 1,671 2.021 
1.436 1.355 1.333 1.476 
1,267 1.226 1.218 1.296 
1.185 1.161 1.158 1.207 
1.138 1.122 1.122 1.155 


1.107. :1.096 1,098 1,121 
1,086 1.078 1.080 1.098 
1.070 1.065 1.067 1.081 
1.059 1.055 1.058 1.068 
1.050 1.047 1.050 1.058 


1.038 1.036 1.037 1.044 
1.026 1.026 1.027 1.031 
1.016 1.016 1.017 1.019 
1,008 1.008 1.009 1.010 
1,002 1,002 1.003 1.003 
1.000 1.000 1.000 1.000 


31.9999 93.2168 | 34.8053 58.6192 81.0688 | 37 5662 


SOURIS Uk WN 


656 


TABLE B.2 
TABLES OF SIGNIFICANCE POINTS FOR THE LAWLEY—HOTELLING TRACE TEST 


er{ we x} =a 
nt 


5% Significance Level 


p=2 
6 


2 10.659* 11,373" 11.562" 11,804* 11,952" 12.052" 12,153* 
3 | 58428 58.915 59.161 59.308 59.407 59.531 59.606 59.655 59.705 
4 | 23.999 23.312 22.918 22.663 22.484 22.250 22.104 22.003 21,901 
§ 115.639 14.864 14422 14135 13,934 13.670 13.504 13,391 13,275 
6 | 12.175 11.411 10.975 10.691 10491 10,228 10.063 9.949 9,832 
7 | 10.334 9,594 9.169 8.893 8.697 8.440 8.277 8.164 8.048 
8 | 9.207 8.488 8.075 7,805 7.614 7.361 7.201 7.090 6,975 
10 | 7.909 7.224 6.829 6.570 6.386 6.141 5.984 5,875 5.761 
12 | 7.190 6.528 6.146 5.894 §.715 5.474 5.320 5.212 5.100 
14 | 6,735 6.090 5.717 5.470 5.294 5.057 4,905 4,798 4.686 
18 | 6,193 5.571 5.209 4,970 4.798 4,566 4416 4.309 4,198 
20 | 6.019 5.405 5,047 4810 4.640 4.410 4,260 4154 4042 
25 | 5.724 5.124 4.774 4.542 4.374 4.147 3.998 3892 3,780 
30 | 5.540 4.949 4.604 4.374 4.209 3.983 3.835 3.729 3.617 
35 | 5,414 4,829 4,488 4,260 4,096 3.872 3,724 3.618 3,505 
40 | 5.322 4,742 4,404 4,178 4.014 3.791 3.643 3.538 3.425 
50} 5.198 4.625 4.290 4.066 3.904 3.682 3.535 3.429 3,315 
60 | 5.118 4,549 4,217 3.994 3,833 3.611 3.465 3.359 3,245 
70 | 5.062 4.496 4.165 3.944 3.783 3.562 3.416 3.310 = 3.196 
80 | 5.020 4.457 4.127 3.907 3.747 3.526 3.380 3.274 3.159 
100 | 4,963 4,403 4.075 3,856 3.696 3.476 3,330 3.224 = 3.109 
200 } 4,851 4.298 3,974 3.757 3.598 3.380 3.234 3.127, 3.012 


4,744 4.197 3.877 3,661 3.504 3.287 3.141 3.035 2.918 
“Multiply by 107, 


657 


TABLE B.2 (Continued) 


1% Significance Level 
p=2 
n\m 2 3 4 5 6 8 10 12 15 
2 | 2467* 2.667 2.7764 2844t 2891 2.952t 20897 3.014t 3.0397 
3 | 2.985" 2.990" 2.992" 2.994% 2995" 2996" 2.997" 2.997" 2.998" 
4 174.275 71,026 69.244 68116 67.337 66.332 65.712 65.290 64.862 
5 |38295 35.567 34070 33.121 32.465 31.615 31,088 30.729 30.364 
6 |26118 23.794 22.517 21.706 21.143 20.413 19.958 19.648 19.332 
7 |20.388 18.326 17.191 16.469 15.967 15.313 14.905 14.626 14.341 
8)17.152 15.268 14229 13.567 13,106 12504 12127 11,868 11.603 
10 |13,.701 12.038 11.120 10.531 10121 9582 9243 9.01) 8.769 
12 |11.920 10,388 9.541 8.996 8615 8113 7.796 7.577 7,351 
14|10844 9.399 8.597 8082 7.720 7.242 6939 6.729 6511 
18 | 9.617 8278 7.533 7.053 6.714 6265 5979 5,780 5,572 
20 | 9.236 7.932 7.206 6.736 6406 5.966 5685 5.489 5.284 
25 | 8604 7.360 6666 6217 5.899 5476 5204 5013 4813 
30 | 8219 7.013 6339 5903 5.593 5180 4914 4726 4.529 
35 | 7.959 6.780 6120 5692 5.389 4982 4720 4535 4339 
40 | 7.773 6.613 5964 5542 5243 4841 4582 4398 4,204 
50 | 7.523 6.389 5.754 5.341 5.048 4653 4397 4216 4,023 
60 | 7.363 6,247 5.621 5.214 4924 4534 4280 4100 3,908 
70 | 7.252 6148 5529 5125 4838 4.451 4199 4020 3.829 
80 | 7.171 6.075 5,461 5.061 4,775 4391 4140 3,961 3.770 
100 | 7059 5.976 5.369 4972 4690 4308 4059 3.881 3.691 
200 | 6.843 5.785 5.191 4.803 4525 4150 3,903 3.727 3.538 
oo | 6,638 5604 5.023 4642 4369 4000 3,757 3.582 3,393 


+Multiply by 104 
*Multply by 107 


658 


TABLE B.2. (Continued) 
5% Significance Level 


p=3 
3 4 5 6 8 10 12 15 20 


3 | 25.930* 26.996* 27.665* 28.125* 28,712* 29.073* 29,316* 29.561* 29,809 
4) 1.188" 1.193" 1.196* 1.198" 1.200" 1.202" 1.203" 1.204* 1.205‘ 
5 | 42.474 41.764 41,305 40.983 40.562 40.300 40.120 39.937 39.750 
6 | 25.456 24.715 24235 23.899 23.458 23.182 22,992 22.799 22.600 
7 | 18.752 18.056 17.605 17.288 16.870 16608 16427 16,241 16,051 
8 115.308 14.657 14233 13.934 13.540 13.290 + 13.118 12.941 12.758 
10 | 11.893 11.306 10.921 10.649 10.287 10.057 9.897 9.732 9.560 
12 | 10.229 9.682 9,323 9.068 8.727 8.509 8.357 8198 8,033 
14 | 9.255 8.736 8.394 8.149 7.822 7,612 7.465 7.311 7.150 
16 | 8.618 8.118 7,788 7.553 7.236 7.031 6.887 6.736 6.577 
18 | 8.170 7,685 7.364 7.135 6.825 6.624 6.483 6.334 = 6.177 
20 | 7.838 7.365 7.051 6.826 6.522 6.325 6.185 6.038 5.882 
25 | 7.294 6.841 6.539 6.323 6.029 5.837 5.700 5.556 5.401 
30 | 6.965 6.524 6.231 6.020 5.732 5.543 5.409 5.265 = 5.112 
35 | 6.745 6.313 6.025 5.818 5.534 5.348 5.214 5.072 4.919 
40 | 6,588 6.162 5.878 5.673 5.393 5.208 5.076 4.934 4.781 
50 | 6.377 5.961 5.682 5.481 5.205 5.022 4.89] 4.750 4.597 
60 | 6.243 5.832 5.558 5.359 5,086 4.904 4.774 4,633 4.480 


6.150 5,744 5.471 5.274 5,003 4,823 4.693 4.553 4,399 
6.082 5.679 5.408 5.212 4.943 4.763 4.634 4.493 4,339 


5,989 5.590 §.322 5.128 4,860 4.682 4.552 4413 4.258 
5.810 5.419 5.156 4.965 4,702 4,525 4,397 4.257 4,102 


5.640 5.256 4.999 4812 4552 4377 4250 4110 3.954 
*Multiply by 102. 


659 


TABLE B.2. (Continued) 


tMultiply by 104, 
*Multiply by 107, 


660 


“1% Significa ice Level 


6.750t 


6.4847 

5.990* 

1.274* 
59,507 
37.994 
28.308 


19.737 
15.973 
13.905 
12.610 
11.729 
11.091 


10.075 
9.479 
9.087 
8.811 
8.448 
8,220 
8.063 
7.948 


7.793 
7.498 


7,222 


4 


5.995" 

1.242* 
$7,032 
35.993 
26.599 


18.355 
14.765 
12,803 
11.581 
10.751 
10.152 


9,201 
8.644 
8.280 
8.023 
7.686 
7.474 
7.329 
7.224 


7.081 
6.808 


6.554 


5 


6.917" 

5.998" 

1.222" 
55.462 
34.721 
25.511 


17.471 
13.990 
12.096 
10.918 
10.120 

9.545 


8.634 
8.102 
7.755 
7.511 
7.189 
6.988 
6.850 
6.750 


6.614 
6.356 


6.116 


6 


7.031" 

6.000* 

1.208* 
54,377 
33.840 
24.755 


16.855 
13.448 
11.599 
10.452 
9.676 
9.117 


8,233 
7,718 
7,382 
7.146 
6.836 
6.642 
6.509 
6.412 


6.281 
6.032 


5.801 


p=3 
8 


7.178¢ 

6.002* 

1.190* 
52.973 
32.695 
23.771 


16.050 
12.737 
10.945 
9.836 
9.087 
8.549 


7,699 
7,205 
6.883 
6.650 
6.360 
6.174 
6.047 
5.955 


5.830 
5.593 


5.373 


10 


7.2677 

6.003* 

1.179* 
52.102 
31.984 
23.157 


15,544 
12.288 
10.530 
9.444 
8.712 
8.186 


7,356 
6.874 
6.560 
6.339 
6.050 
5.870 
5.746 
5.656 


5.534 
5.304 


5.089 


12 


7.328" 

6.005* 

1.172" 
51.509 
31.498 
22.737 


15.197 
11.978 
10.243 
9.172 
8.450 
7.932 


TALS 
6.641 
6.332 
6.115 
5.831 
5.653 
5.531 
5.443 


5.323 
5.096 


4.885 


15 


7.389T 

6.006" 

1.164* 
50.906 
31.002 
22.308 


14.840 
11.659 
9.946 
8.890 
8.178 
7.668 


6,803 
6.395 
6.091 
5.877 
5.597 
5.422 
5,302 
5.215 


5.096 
4.873 


4.664 


20 


7.4517 

6.007 * 

1.1564 
50.292 
30.496 
21.868 


14.472 
11.328 
9.638 
8.596 
7.893 
7.390 


6.596 
6.135 
5.834 
5.623 
5.346 
5.172 
5.053 
4,967 


4.850 
4,627 


4.419 


TABLE B.2 (Continited) 


5% Significance Level 
p=4 


12 


4 |49.964* 51.204* 52.054* 53,142* 53.808" 54.258* 54.71% §5.17% 55.46% 
5 | 1.996* 2.001* 2.005" 2.009" 2.011* 2.013" 2.015" 2.016% 2.017* 
6 | 65.715 64.999 64.497 63.841 63.432 63.151 62.866 62.573 62.396 
7 | 37.343 36.629 36129 35.474 35,064 34.782 34495 34.200 34.019 
8 | 26.516 25.868 25.413 24814 24.437 24178 23.912 23.639 23.471 


10 | 17.875 17.326 16.938 16424 16.098 15.872 15.640 15.399 15,250 
12 | 14.338 13,848 13.500 13.037) 12.741 12.535 12.321 12.099 11.961 
14) 12.455 12.002 11.680 11.248 10.972 10.778 10.577 10,366 10.234 
16] 11.295 10.868 10.563 10.154 9.890 9.705 9.512 9.309 = 9.181 
18 | 10.512 = 10.104 9.812 9,419 9.165 8.986 8.798 8.600 8.475 
20 | 9.950 9.556 9.274 8.893 8.645 8.471 8.287 8.093 7.970 


25 | 9.059 8.688 £422 8.062 7.826 7.659 7.482 7.293) 7.173 
30 | 8.538 8.182 7,927 7.578 7.350 7.188 7.015 6.829 6.710 
35 | 8197 7.852 7,603 7,263 7,040 6.880 6.710 6.526 6.408 
40 | 7.957 7.619 7,375 7.041 6,821 6.664 6.495 6.313 6.195 
50 | 7.640 7.313 7.075 6.750 6.535 6.380 6.214 6.033 5.916 
60 | 7.442 7.120 6.887 6.568 6.356 6.203 6.038 5.858 5.740 
70 | 7.305 6.988 6.758 6.443 6.232 6.081 3.917 5.738 5.620 
80 | 7.206 6.892 6.665 6.351 6.143 5.992 5.829 $.650 5,532 


100 | 7.071 6.762 6.537 6.228 6.021 5.872 5.710 5.531 5.413 
200 | 6.814 6.514 6.295 5.993 5.791 5.644 5.484 5.305 5.186 


0 6.574 6.282 6.069 5.774 5.576 5.431 5,272 5.094 4,974 


*Multiply by 102. 


661 


TABLE B.2. (Continued) 


1% Significance Level 
p=4 
5 6 8 10 12 15 20 25 


4 |12.491t 12.800t 13.012' 13.283t 13.4497 13.561t -13.67' 13.79t 13.877 
5 | 9.999" 10,004* 10.008* 10.012" 10.014* 10.016" 10.018" 10.02* 10.02* 
6 | 1.938" 1.906* 1.885" 1.857% 1.840% 1828" 1.816% 1.804* 1.797" 
7 |85.053 82.731 81.125 979.047) -77.759 = 76.882 »=—-75.989 = 75.082 74,522 
8 |51.991 50.178 48.921 47.290 46.276 45.583 44.877 44.156 43.715 


10 | 29.789 28.478 27.566 26.376 25.632 25.121 24.597 24.060 23.731 
12 | 21.965 20.889 20.138 19.154 18.534 18,108 17.668 17.215 16.936 
14 ]18.142 17.199 16539 15.670 15.121 14.742 14.349 13.943 13.691 
16 |15.916 15.059 14457 13.662 13,157 12.807 12.444 12.066 11.831 
18 | 14.473 13.674 13.112 12.368 11.894 11.564 11.221 10.863 10.639 
20 113.466 12.710 12.177 11.470 11.018 10.703 10.374 10.030 9.814 


25 111.924 11.237 10.751 10,103 9.687 9.395 9,089 8.766 8,562 
30} 11.055 10.409 9,951 9,338 8,943 8.665 8.372 8.060 7.863 
35 | 10.499 9.880 9,440 8.851 8.470 8.200 7.915 7.611 7.418 
40 | 10.114 9.514 9.087 8,514 8.142 7.879 7.600 7.301 7,110 


50; 9.614 9,040 8,631 8.079 7,720 7.465 7.194 6.902 6.713 
60 | 9.305 8.747 8.319 7,311 7,460 7.210 6.943 6.655 6.468 
70 | 9.095 8.549 8.158 7,630 7.284 7.037 6.774 6.488 6.301 
80 | 8.944 8.405 8.020 7.498 7.157 6.912 6.651 6.367 6.181 


100 | 8.739 8.211 7,833 7,321 6.985 6.744 6.486 6.204 6.019 
200 | 8.354 7.848 7,484 6.990 6.664 6 429 6.176 5,898 5.714 
00 8.000 7.513 7,163 6.686 6.369 6.140 5.892 5.616 $.432 
tMultiply by 104 
*Multiply by 10° 


662 


$1,991" 
3.009" 

93.762 

$1,339 


27,667 
20.169 
16.643 
14.624 
13,326 
12,424 


11.046 
10,270 
9.774 
9,429 


8.982 
8.706 
8.517 
8.381 


8.197 
7,850 
7.531 


+Multiply by 10°, 
*Multiply by 107, 


TABLE B.2 (Continued) 


5% Significance Level 


6 


83.352" 
3,014* 

93.042 

30,646 


27.115 
19.701 
16.224 
14.239 
12.963 
12.078 


10.728 
9.969 
9,484 
9,147 


8.711 
8.441 
8.257 
8.124 


7,945 
7,607 
7.295 


8 


85.093" 
3.020* 

92.102 

49,739 


26.387 
19.079 
15.666 
13.722 
12.476 
11.612 


10.297 
9,559 
9.088 
8.761 


8,339 
8.077 
7.899 
7.770 


7597 
7.271 
6.970 


10 


86.160t 
3.024 

91.515 

49.170 


25.927 
18.683 
15.309 
13.389 
12.161 
11.310 


10.016 
9,291 
8.828 
8.507 


8.092 
7.836 
7.661 
7.535 


7,365 
7.045 
6.750 


ps 
12 


86.88t 


3.027 


91.113 
48.780 


25.610 
18.409 
15.059 
13.157 
11.939 
11.097 


9.817 
9.099 
8.642 
8.325 


7.915 
7,662 
7.489 
7.365 


7,197 
6.881 
6.590 


15 


3.029T 


90.705 
48,382 


25,284 
18.124 
14.800 
12.914 
11.708 
10.874 


9.606 
8.896 
8.444 
8.130 


7,725 
7.474 
7,304 
7,181 


7,014 
6.702 
6.414 


663 


“Multiply by 107. 


664 


48.048 
31.108 
24.016 
20.240 
17.929 
16.380 


14,107 
12.880 
12,115 
11.593 


10,928 
10.523 
10.251 
10,055 


9,793 
9.306 
8,863 


20.495* 
15.014* 
2.735* 
1.150* 


46.670 
30.065 
23.145 
19.472 
17,228 
15.727 


13.529 
12.345 
11.607 
11.105 


10.465 
10.076 
9.814 
9.626 


9.374 
8.907 
8.482 


20.834* 
15.019* 
2.704* 
1,128* 


TABLE B.2 (Continued) 


21,267* 
15.025* 
2.665* 
1,099* 


44,877 
28.701 
22.001 
18.459 
16,302 
14,862 


12.759 
11,629 
10.926 
10,448 


9.841 
9.471 
9.223 
9.045 


8.806 
8.363 
7,961 


1% Significance Level 


p=5 


15.033* 
2,623* 
1.069* 


42,992 
27.257 
20,781 
17.373 
15.304 
13.925 


11,918 
10.842 
10.174 

9.720 


9.144 
8.794 
8,559 
8.390 


8.164 
7.745 
7,365 


15.03* 


2.606 * 
1.057* 


42.210 
26.653 
20.268 
16.913 
14,878 
13.525 


11,555 
10.500 
9,845 
9.401 


8,836 
8.493 
8.263 
8.097 


7.876 
7.465 
7,093 


TABLE B.2 (Continued) 


5% Significance Level 


12 


p=6 
15 20 25 
43.103 42.626 42,334 
26.843 26.451 26.209 
20.489 20144 19,929 
17.202 16.886 16.688 
15.218 14.921 14.735 
13,899 13.615 13.436 
11,975 11.711 11.544 
10,939 10.687 10.526 
10.293 10.049 9.892 
9.353 9.614 9.460 
9.293 9.060 8908 
8.951 8.721 8.572 
8.720 8.494 8345 
8.555 8330 8.182 
8.333 8110 7.963 
7.919 7.701 7.555 
7.689 7.473 7.328 
7.616 7.400 7.255 
7.543 7.328 «= 7.183 


30 


42.136 
26.044 
19.783 
16.553 
14,607 
13.313 


11.428 
10.414 
9.782 
9.351 


8.801 
8.465 
8.239 
8.076 


7.857 
7,449 
7,222 
7.149 


7.077 


35 


41.993 
25.925 
19.677 
16.455 
14,513 
13,223 


11.343 
10,331 
9.700 
9.270 


8.721 
8.385 
8,159 
7.996 


7.977 
7.369 
7.140 
7.067 


6.994 


n\m 


666 


6 


86.397 
46.027 
32.433 
25.977 
22.292 
19.935 


16.642 
14.944 
13.913 
13,223 


12.358 
11.839 
11.493 
11.246 


10.917 
10,312 
9.980 
9.874 


9.770 


8 


83,565 
44.103 
30.918 
24.689 
21.146 
18.886 


15.737 
14,118 
13.138 
12.482 


11.661 
11.169 
10.841 
10.607 


10.295 
9,723 
9.409 
9,308 


9.210 


TABLE B2 (Continued) 


10 


81.804 
42.899 
29.966 
23.875 
20.418 
18.217 


15.156 
13.586 
12.635 
12,000 


11.206 
10.730 
10.413 
10.187 


9,886 
9.333 
9.030 
8.933 


8.838 


1% Significance Level 


12 


80.602 
42.073 
29.309 
23.311 
19,913 
17.752 


14.749 
13.211 
12.281 
11.659 


10.882 
10.417 
10.107 

9.886 


9.592 
9,052 
8.755 
8.661 


8.568 


p=6 
15 

79.376 
41.227 
28.634 
22.729 
19.389 
17.267 


14.324 
12,816 
11.906 
11.298 


10.538 
10.083 
9.779 
9,563 


9.276 
8.748 
8.458 
8.365 


8.274 


20 


78,124 
40.359 
27.936 
22.126 
18.844 
16.761 


13.875 
12.398 
11.506 
10.911 


10.167 
9.721 
9.424 
9.212 


8.930 
8.412 
8.128 
8.037 


7.948 


25 


77,360 
39.826 
27.507 
21.753 
18.505 
16.445 


13.592 
12,133 
11.252 
10.663 


9.927 
9.486 
9,192 
8.983 


8.703 
8.190 
7.907 
7.817 


7.728 


30 


76.845 
39.466 
27.215 
21.498 
18.273 
16.229 


13.397 
11.949 
11.074 
10.490 


9.759 
9.320 
9.028 
8.819 


8.541 
8.030 
7.747 
7,657 


7,568 


35 


76.474 
39.206 
27.004 
21.314 
18.105 
16.071 


13.254 
11,814 
10.943 
10.361 


9.633 
9,196 
8.905 
8.697 


8.419 
7.908 
7.625 
7.534 


7,446 


TABLE B.2 


(Continyed) 


5% Significance Level 
p=7 


83.426 
41.627 
28.961 
23.158 
19.893 
17.819 


14.930 
13.440 
12,535 
11.927 


11.165 
10.706 
10.400 
10.181 


9.889 
9.350 
9.054 
8.959 


8.866 


15 


82.755 
41.113 
28,534 
22.781 
19 549 
17.498 


14.642 
13,172 
12.278 
11.679 


10.927 
10.475 
10.173 

9.957 


9.669 
9.138 
8.846 
8.753 


8.661 


20 


25 


82.068 
40.583 
28.091 
22,389 
19.189 
17,159 


14.337 
12.884 
12.002 
11.411 


10.668 
10,221 
9.923 
9.710 


9.426 
8.902 
8,613 
8.521 


8.431 


81.648 
40.257 
27,817 
22.145 
18.964 
16.947 


14.143 
12.701 
11.825 
11.237 


10.500 
10.056 
9.760 
9.548 


9.265 
8.744 
8.456 
8.365 


8.275 


30 


81.364 
40.037 
27.631 
21.978 
18.809 
16.800 


14.009 
12.573 
11.700 
11,115 


10.381 
5,938 
9.643 
9,432 


9.150 
8.629 
8.342 
8.250 


8.160 


35 


81.159 
39.877 
27,495 
21.857 
18 696 
16.694 


13.911 
12,478 
11.608 
11,025 


10.292 
9,850 
9.555 
9,344 


9,062 
8.542 
8.254 
8.162 


8.072 


667 


668 


71.731 
44.255 
33.097 
27.273 
23.757 


19.117 
16.848 
15,512 
14.634 


13.553 
12.914 
12.492 
12,193 


11,797 
11,077 
10.685 
10,561 


10.439 


69.978 
42.978 
32.057 
26.374 
22,949 


18.440 
16.239 
14.945 
14.095 


13.049 
12.432 
12.024 
11.736 


11.353 
10.658 
10.230 
10.160 


10.043 


TABLE B.2 


68.779 
42.099 
31.339 
25.750 
22.388 


17.965 
15.810 
14.544 
13.713 


12.691 
12.088 
11.690 
11.408 


11.034 
10.356 
9.987 
9.869 


9.755 


P 
15 


67.552 
41.197 
30.599 
25.105 
21.804 


17.469 
15.360 
14.121 
13,309 


12.310 
11.720 
11,332 
11.056 


10.691 
10.028 
9.668 
9,553 


9,441 


(Continued) 


1% Significance Level 


=7 
20 


66.296 
40,269 
2°.834 
24.435 
21.195 


16.947 
14.882 
13.670 
12.876 


11.899 
11.323 
10.942 
10.673 


10.316 
9.667 
9.314 
9.202 


9.092 


25 


65.528 
39.698 
29.361 
24.019 
20.816 


16.619 
14.580 
13,383 
12.599 


11.634 
11.065 
10.689 
10.422 


10.070 
9.427 
9.078 
8.966 


8.857 


30 


185.93 182.94 180.90 178.83 176.73 175.44 174.57 173.92 


65.010 
39,311 
29.039 
23.735 
20.556 


16.392 
14.370 
13,183 
12,405 


11.448 
10.882 
10.509 
10.244 


9.894 
9.254 
8.906 
8.795 


8.686 


64.636 
39.032 
28.806 
23,529 
20.367 


16.227 
14.216 
13.036 
12.262 


11.309 
10.746 
10.374 
10.110 


9.761 
9.123 
8.774 
8.663 


8.555 


TABLE B.2 (Continued) 
5% Significance Level 
p=8 


15 20 25 30 


41.737 41.198 
31.894 31.242 30.788 30.318 29.829 29.525 29.318 29.167 
26.421 25.847 25.446 25.028 24.591 24.319 24132 23.996 
23.127 22.605 22.239 21.856 21.454 21,201 21.028 20.902 


18.770 18.324 18.009 17.677 17.325 17.102 16.947 16.834 
16.626 16.221 15.934 15.629 15.303 15.095 14.950 14.843 
15.356 14.977 14.707 14.418 14.109 13.910 13.771 13.668 
14.518 14.156 13.898 13.621 13.322 13.129 12,994 12,893 


13.482 13.142 12.898 12.636 12.351 12,165 12.034 11.936 
12.866 12,540 12.305 12.051 11.774 11,593 11.465 11.368 
12.459 12.142 11.912 11.665 11.393 11.215 11.088 10.992 
12.169 11.858 11.634 11.390 11.122 10.946 10.820 10.725 


11,785 11.483 11,264 11.026 10.763 10.590 10.465 10,370 
11.084 10.798 10.589 10.362 10.108 9939 9816 9,722 
10.701 10.423 10.221 9.999 9.751 9584 9,461 9 367 
10.579 10.304 10.104 9.884 9.637 9.470 9.348 9.254 


10.459 10.188 9.989 9.771 9526 9.360 9.238 9.144 


669 


TABLE B.2 (Continued) 
SSS SS eee 


1% Significance Level 


n\n 8g 10 12 15 20 25 30 35 


14) 65.793 64.035 62.828 61.592 60.323 59.545 59.019 58.639 
16 | 44.977 43.633 42.707 41.754 40.771 40.164 39.753 39.456 
18 35.265 34.146 33.373 32.573 31.745 31.232 30.882 30.629 
20 | 29.786 28.808 28.129 27.425 26.691 26.235 25.924 25.697 


25) 23.001 22.212 21.661 21.085 20.480 20,100 19.838 19.647 
30] 19.867 19.173 18.686 18.173 17.631 17.288 17,051 16,876 
35 | 18.077 17.440 16.991 16.516 16011 15.690 15.466 15.301 
40 | 16.924 16.324 15.900 15.451 14.970 14.662 14.447 14,288 


50 | 15.528 14.975 14582 14.163 13.711 13.420 13.216 13.063 
60] 14.715 14190 13.815 13.414 12.980 12.698 12.499 12.351 
70} 14.184 13.677 13.313. 12.925 12.502 12.226 12.031 11.885 
80 | 13,810 13,315 12.960 12.580 12.165 11.894 112.702 11.556 


100 | 13.317) 12.839 12.496 12.127 11.722 11.457 11.267 11.124 
200] 12.429 11.983 Lb.66Q 11.311 10.925 10.669 10.484 10.343 
$00] 11.951 11,521 11.210 10,871 10.495 10.244 10.061 9.921 
1000 | 11.800 11.375 11.067 10.732 10.359 10.109 9.927 9.787 


oo | 11.652 11.233 10.928 10.597 10.227 9.978 9.796 9.656 


670 


nvm 


10 


98.999 
58.554 
43.061 
35,146 


26.080 
22.140 
19.955 
18,569 


16,913 
15,960 
15,341 
14.907 


14,338 
13,319 
12,774 
12,602 


12.434 


12 


98.013 
57,814 
42.454 
34.620 


25.660 
21.773 
19.618 
18.252 


16.622 
15,684 
15,074 
14,647 


14.087 
13,085 
12,548 
12,379 


12,214 


TABLE B.2 (Continued) 
5% Significance Level 


15 


97.002 
57.050 
41,824 
34.071 


25.219 
21.384 
19.260 
17,914 


16,309 
15.385 
14.786 
14,365 


13.814 
12.828 
12.301 
12,134 


11.972 


p= 10 
20 
95.963 
56.260 
41,169 
33.497 


24.753 
20.970 
18.876 
17,550 


15.969 
15.059 
14.469 
14.055 


13.513 
12,542 
12.023 
11.859 


11.700 


25 


95.326 
55.772 
40.762 
33.140 


24.458 
20.706 
18.630 
17.316 


15.748 
14.847 
14.261 
13.851 


13.313 
12.351 
11.836 
11.674 


11.515 


30 


94.9 
55.44 
40.485 
32.895 


24.255 
20.523 
18.458 
17.151 


15,592 
14.695 
14,113 
13,705 


13.170 
12.212 
11.699 
11.538 


11.380 


35 


94.6 
55.20 


40,284 


32.716 


24,107 
20.388 
18.331 
17,029 


15.476 
14.582 
14,002 
13,595 


13,061 
12.106 
11,594 
11,432 


11.275 


671 


672 


n\m 


10 


180.90 
89.068 
59.564 
45,963 


31,774 
26.115 
23.116 
21.267 


19,114 
17.901 
17.124 
16.583 


15,881 
14.641 
13.986 
13.780 


13.581 


TABLE B.2 (Continued) 
1% Significance Level 


12 


178.28 


87.414 
58.328 
44,951 


31.029 
25.489 
22.556 
20.749 


18.646 
17.462 
16.703 
16.175 


15.490 
14,280 
13.641 
13.441 


13.246 


15 


175,62 
85,270 
57.055 
43,905 


30.253 
24,832 
21,966 
20.201 


18,148 
16.992 
16.252 
15.738 


15,069 
13.889 
13.266 
13.070 


12.881 


p= i0 
20 
172.91 
83.980 


55.742 
42.821 


29.440 
24.139 
21,338 
19,615 


17.611 
16,484 
15,762 
15.260 


14.608 
13.457 
12.848 
12.658 


12.472 


25 


171.24 
82.91 
54,933 
42,150 


28.932 
23.701 
20.939 
19.241 


17.266 
16.154 
15,443 
14,948 


14.305 
13,169 
12.569 
12.381 


12.198 


30 


170 
82.2 
$4,384 
41.693 


28.583 
23,399 
20.663 
18,980 


17.023 
15.922 
15.216 
14.726 


14.088 
12.962 
12.366 
12.179 


11.997 


42383 88928 35 | 


673 


| —Qm~weg om al re) “oO a7 m men 3 

o8SS% §S982 23 | SISSK SaVes =3 BA8S3 Sk2S3 sh 

| NANA NOAA NAN LANA NN ed sesame Seine oe | sins wwwww ww 

| NOOK @wr-on Qo | —nnme — on mm Ano SSSAT Bax 2 3 

ShS95 BREESE BS | SESS SBSES BE wv | SEs SSBSS FAST F 
NANA ANNAN NN LNA icin mes eeseseses cseiciew ww | icwww wwwww ww 

‘é 
RAO wowun -_ o= S&S auweoa or SI33S3 wr 3% 
BRAGS SEES S= | RSSRH BSARE SS % 32a52 8 
NANA AN een oes | ieee een edened eed wWwwww Vw 
gmy2 ~-axMm CRWO MRAWWO On aN o5283 ZBLY ro) 

BSESS SzB=F SB | SERZS SSISS BB Rass? BO5SS BQPES 35 
NNN mieiciesien essed | mieiedeied meseiesed men aM 
NOD =A wh or Qunn- 
Bohaes SHSER AB | SHSS5 SSXsS 
ANN esededesed ened | seemed meminm ow 

} is 


TABLE B.3 
TABLES OF SIGNIFICANCE Polt'TS FOR THE BARTLETT—- NANDA-PILLAl TRACE TEST 


ime cow aN Lp] 
n |8838s S238 33 | $IHS3 ZBSRS SS $3 
Nieseieses eiesenenes me [meinem mrinniw ww ww 
x — & wen oe an 
SASL S3H23 gs Bao2s BS8Ro8 oS 9539 SS 
eiesesedeg ehedesened ee [eseienm mwwww ~wwwww ww 
— 90 noe 83 25x —~ Rwmo BS SVRA5 SESs3 
Load “ OO 
RRSS2 SESHS Be | SBR35 SASL BE | Seass 885s 
mesedene eieseiesen one | wieieww ewww w ww 
WOMN NWoOWoO won wv NO oowe nem ZB 
nw er) a OO Lad ~ 33 — 


(Continued) 


TABLE B.3 


B9s23 22525 


3s 


ww 


wor ea TS 
Bo 235 ERaRY 


one sete. 


== 
ms 
main 
Fars 3 
wWuUWwW wn 1H Baas pate wu 
~ -_ Newnan ~) ann mo 
wad Q $33- rs Pid 
eo {432728 RSSSS FB S25 88 
Sees ond, cee eo Pag [Poni eather cs ete Ba 
Viviun waded wud was Widow" Ww 
Aan... mo a3 or | wages g TERS ne 
=~ = —O ps 
o | 23882 32983 SS | SBS35 Z5ER8 I2 
Wiis wis wid | ii WS CO 
AVwhOe os Sor ar “ = 383 BS 
wo — Ca | — NN N 
SBSRS ShSa5 AK | SkSES Fs E 
MMe a ee Sree WHS WS 
~~ es — o> 
MEM A eres’ pene Boss SS 
NA-@e oa—-|AMm ~o gB2e5 ite) 
Oo ne OO 
o | 85a3% $5883 53 | SS25E s8sEh SF 
Minus “uwiin) C0 | oc SSW RK 
ie) P-) tN oO RP ad B 4 ASM QM 
wv [-*] wv 
— FeLRE Coses NAA | BSSKR S5=hM VS 
“annum uwsawow 3s | occ SR RRR NE 
~ 3F wv Sess > ov BS2s3ssg x 2e5 32 
oan BY oS B3es Q S355R 8 
VMS CWOWOW OO nn RARKRER Ko 
Bro’ OO 233 2—-pow gase 5s 
» |g 29$f 98 | ese28 BASS 5 
amwak SR oO & ALor be 23 =R 
OOGOSG COSCO OR | RAR R Re Be 
N nw = © 
a | $8¢82 Sg88S S2 | BSSS3 SahIR Ss 
MOCRRKR RKRAKRE RE | oc COOKS OS 
= 
Ped Se3 an MAMAN 38 me 35 
= = SER pays 
See SohaS 39 / SSESS BRSKES SH 
asada KOKSR CH | GCOH—HK— = HAAN 
Cosas wee Oe 
E/ wna 298 | Mao 2934 2 
(| anaes 29882 Anane 323 F 


15 


9 


8 


7 


5 


4 


23383 $285 
Wun Minow Ww 


BSS33 8288 FF 


AB eg) ABs yes ine, 0 NS ow 


BERS3 BIZRe BS 


BSSS9 24888 Ba 


SOG oS Ww 


HS8aS 33583 SE 


SOS OH wew WS 


~~ -_ 932 
3S Lagak SH 


SWS SS www RK 


33233 RESSS RK 


CwOdo~w COR RR RK 


@589 BSBA 29 


gasas RYYSE SF 


Paar ay a ae sees ee a ee 


Oe ee ee 


AAA RL 


674 


TABLE B3 (Continued) 


| 

RBZE% SHESE JS | SSARA SSSR BF g | e888 $8884 FF | SASRS SHERZ BE 
C6666 CNRNN NN | RRR AAR eR | NAKA daidaad da | daddd GHdHES aie 
BEI*S SAIS FF | SSSRS ESTE I wy |SSRSS BHSI2 Rs | $SSSR ZER=R SI 
GRIN KANAKRK NK [RAR RAR ed BGdddd ddiddd dd | Gaiddad dadoce oo 
RESSS SEGRE BB | 84SS2 AIRS = e | S8385 SSE28 83 | 2823s 49SEk 23 
RERRN NARRN AN RAN Sddaad wo Siddiad caidaad ao | acco cocece oc 
ISIAH SEES ZS | SBSER BISSE BB o |S3288 RakSS SE | $5555 aSRBE BA 
RANAN ANARN No | RNa didi do ciddd ddidda co | ocoos cocce cd 
$2883 FSBIZ SF | SLBSI BIEEE Bz @ | $8323 32883 88 | RS555 SSRS5 £9 
NANAN RARNG cid | didi adda co Sidda GHC Ce | Sree ceeds cc 
SSzRZ $2383 KS | S4963 SES25 83 ~ | S8a83 S8ER8 38 | R8Sha Sshag GR 
NANAN Nncicd dad | dadaad daicoe oo adddd coccoce co | ococoe ogdaa co 
RBSSS Ba8k9 88 SHr8S S2555 SE ° | 28883 RRE94 SB | SSKE SBATS BB 
NANA G dcdcdd aa | daddda Soeoe oo addeo coooo co | ood soddc d- 
SESEA SERS SB | SSS22 SHSZ— SS » | 35885 9S982 SS | SaRS3 AXS3z RY 
Knidad Gidad dew | ddooer ceeca dc See eeecee coe | essded sods =~ 
BSSSF S298s SF | SEGSS SBASY EX ~ | sea8$ 82283 83 | SI8sF SSISN BS 
Sdddd daddadd oo | cocce coscod dS Sees Secdda dd |ocded HHHiH a 

SERS SESRS BS | SS=RS SESRR YE m | 25885 abQcs &8 | $3553 23838 SE 
Bidder Soca co |oodcd soils iH SSess dddds so | SHH H Hoinina aa 


R3GS2 $8228 TZ SERFS ZESIE BS ~ | S888 82988 RF | S8SSE FSRS= Be 
aoe rie 3 be 


ay te we ee ey 


SFR eooos Seas saananad ns | 006 6] OSES Hates r=] eas eax re 
RERSS FSSRS SE | ASRS BERER FE ~ | $3322 88992 83 | Sas98 ABRSE Ss 
iitidd @ Hddaa ddside Sx | FX4d5 Seccy da 


®RARS RVBBA Zh 


675 


TABLE B.3 (Continued) 


p=s 


se Sree 


28258 9 


SOSA CORSA KK 


SRERR SERS SE 


en 


a ne eee ee 


oe ot od ot os 


NT ee nn nee ee Se ee ee, ee 


ROOK OO ecees 


a — 


SeUeE greed 2/5 
3 


2a aS 


a SU Oe ee ee ee 


——— a ad == a 


et oe 


~ 9 « ae 


asses SREAE 3 35 
22Ss= 85 adda ooddd sy 
EXSEE SEBRS FS 

ie ¥yuii oid 


EERE FBII5 SE 


iwwwi Aw | wid waded ado 


ple ed ee a tet 


a a at nt ad ad ad md 


p=10 


2g832 25825 


ere ee ee ee ee 


eo 4 2 4 eee te 


F2S28 FES3R 


a 


R3H8 oo 


aligns waRoMm 


ieed 


~~ 


—— 


~~ 


- 2 # ee 
as ae ee es 


tae es es es 


o 2 # #1 


ee a ee 


B ISS? 
ANNAN 


ae a a a 


a a ee a 


SS589 Se 


Nessie ed 


a a at 


83288 SAIKE 33 


oa ees 


Salad ua al 


S5R88 a3 


SBER3 2ES3= 


wag ds SNR Rad 


676 


TABLE B4 
TABLES OF SIGNIFICANCE POINTS FOR THE ROY MAXIMUM Root TEST 


—™ —~ 300 oer — eon7Tm™ ~wy-—-o 4—O — oO amnwoon SOoOMmm fon 
2  BRsq5 SG825 FE | §SSBITS SBaR8 S5 e | 889% 2RS8S <9 | SaSee BSSS5 23 
ia ho hoa seh RNS ose ee OS OSES mee ee ee gy | Oe ee Kan 
BOAO COOMOM —e | Mee Mm.s mmo an Ha yO eS3g o WHR yO OMe} 35 
o | BSSSE S258 oF | GSha5 BESSR aR 9 |S$8n85 sSBS8 24 | S525 SHa85 Sa 
wom oaee aa | oe HANA eee HAN I | CHO ine Oe 
o |SS3z25 SESSE Se | SESSS ESHRS SE 2 /EESSS S895 BS | BS95s aBSSS Be 
NNN NO | KONE NNN NO r BHAA RAMANA NA | NAN NOMA ied 
> |SSS28 S52R8 EF | ESBSF TaeSH 33 | o  S2385 38998 32 | 28595 G3ekS 28 
Bee NNN NO | ENN NNN ON AANA NANA A | AANA NINN im 
eoonr m6 a wm oo @eM —QOrn o now AFTON oy; — pro Qo 
o | £4338 88353 $3 | S253 S5FS5 38 o | BESS SHIP 38 | RSSIR aBsF5 43 
MANA NANA AN LAAN NNN OS AANA AMAA AO | NAAN OO es mies 
om [ in A ot bao) mx oan eter BBs eo] oom wom ow 952358 lak" tual» kJ 

~ | 23935 ZS835 SF | S49SS3 S283 38 ~ | BERS 2S885 SB | $RzSE SIRRS BS 
AANA ANNAN NO | AOC NOON) Am ANNAN ANNE NE | ONAN Se mie 

—Mmao Moco ™ mwa oP mw 33 < _ 
» | BOSS PEHasg SB | SaSks SsSae Fz 83% 32 
AANAN ANNAN AM | ANN ONS eS 1 mies 
=—KR@2M OBR OM OM —-ASm uwyvo=< wn aw a 
o |$eB3S GEBSq HS | sS8IS KHIKq SE Ra 
SANA AAA Oe | SS ese mies ae 
Wag ON 0 rem on Aw —-on oc nay 
« | 88383 Se955 8S | SARS YaSs es 25 
NANA ANN ied [mmm cimw Ww ~~ 
2 ata 
RS 38 
ww act 
= BS 
i MIMO nt 
2 29 

32 S 
oon as 
= 
28 8 


677 


15 


0 


TABLE B.4 (Continued) 


tt 


zg°eR8s oT | a2 as RBSSIA 33 s 35 fal i wy WRON- OF 
FBSSS SSESB SE | [FASS SSzA5 g | S828 SE8S8 B8 | SABES Szh8s GE 
feted Beta AN | Heese HNN An ppc arten Sigicie’ aia] H-HesH NANAN AN 
w¥EWOo Owy- ma }] WMI—= orm ~~ OS ger Worn OonaF wo jij mqwoe so 2] wo 
SSZRE F28SH SE | SRSS|B NSIT BZ . |SS3BS S85SE FS | SSSSEx BISEE SS 
Beet AANA AA | PEK NAAN = Bea ANNAN AN | HRA AO 5 
Owhs wv eno — row onoeaan ond ww NWO. — a mo ae oO 
&NS3 ESl6S E) SklHS CSSSS SS ~ | £2838 B=2SBS 35 | S8RSE S8SR5 FS 
Adda Adcidic NA | AAA CN res ed ne NAMA AMA ied | eacseseied eed 
rn borg we a2mQwra o— = = 
BASS3 SILSI HE | REGS BEFR5 Ys o | 288a5 SHES 28 | $3828 SRBES XE 
AAA ANNAN Ce | MANN A rict ed Mbed AAA AA ed | MAAN citric wild 
wa OO 23 833 -—eowe o3S __ Anonm ww WNC MmN — QO CON 
RSeS 3 § 5 SRIF BSBLS 5S o | S9$235 33385 R93 | s8SsS aloes 2 
NANAN peers mi | Nai siemens ried ANA Siete cies | ANd mieiciein mw 
o Mw ~ “OMRON + ~ om Msiee iwgeant 
S8BL= RZETZ 33 | SSBSE SHSS5 8 ~ |$28$n8 ISS 28 | SP4AS SSTES KE 
NANA Ais het | Miri mieteteied ctw Aicinedes meimcses cin | Ames Seiciww Ww 
ONonuMwm Ow [ua a = OO Oo —_ Load - 
SS8RF SAFES HS | BSaAS SSS8S se « |SS8R5 PR852 ES | SRFSS SBAKS Be 
ANA Secs mim | ANd scniww We Neiesieses cieieieieg med | ieee Ow ww Ww 
Oo ~mm@wo on | YU muy o 
M2823 IFBzE BS | SERPS SERES ZI » |SRESY S83 Rp | SBSSS SES5S 38 
Amin semimedt cm | mimden wiwww ww meen sow wo ww | inwd www vind 
wr mmo ao worn — br - 2} _ ae 
fe Sule Sessa 3 wena 33a > e3e Spemw ve BRS5g 2or% 83 
$3IN8 SFS3R AR | SHnV SRKSA AS ~ | 33858 S8RIS SB | BRSBE Savar 33 
mimic ciwww ww | sow Fw avi mejciww www ww | www SMM IS 
papa 3-4 E393 33 Fner3 Spsse o=4 “ —an-—-7¥m mmo by deel od 
ON nae oO = 3S 
BRAS ZESES SB |IRSAS ARBBA 4a o | $5232 S859 38 | $2588 S8853 2 
www Ww i | WM MiMi www MS MM | WIM) CSR 
— wh [ef Ee at i—b- 3] Onom - _ 
RBIS SIBSE VB RBLSS RISES ER w | 38282 SSS08 Fz | SF38S S4zz5 23 
wiwimiwind wind 64 [ OSCR RRR Ro od MSO COW SKR KK] SRR cada OO 
~~ 2e— o Mon aom™ Mir? —nRww - o «ao; ini 2 
Bkess 3283 23 | RSBSS BRSCS BE — | §2882 S5SSE FS | SESBS TESS BB 
Lo eae a aoe iit Sade tirhaee tO teas re ie hy 
aAdaiadad coocesr oe |OCOmAm= MANNA OM cxeocd Gdécc0 o- | —Hanm timwwT 4 
ese eeeee Sa] aaah H---=- xa 
Claas BV{R Ve Paad& SVVan z / | 28RAaS SPB Zi | FBNAS BIISS 28 
Ze 
pues _ o_o +2. .w oe]-— at Rae 
“ Oo _ 
o = 4 ) bas] S 


678 


TABLE B.4 (Confirmed) 


2 3 4 5 é 7 8 9 10 15 20 
6.470 4.86% 4.005 3.468 3.098 2.827 2620 2.455 2.322 1 908 1.693 
6.634 5.001 4128 3.579 3.199 2.9270 2.705 2.435 2.396 1.965 1.740 
6880 5.216 4.320 3.753 3.359 3.068 2.844 2.665 2.519 2.062 1.819 
7.058 5.372 4.462 3884 3.481 3.182 2.951 2.767 2.615 2.139 1.884 
7.188 5.491 4.57% 3.985 3.576 3.272 3.03% 2.848 2.693 2.203 1.937 
7.544 5.625 4.695 4.101 3.486 3.376 3.136 2.943 2.785 2.280 2.006 
7.495 S774 4,836 4.235 3.813 3.498 3.253 3.057 2.874 2.375 2.090 
7.675 5.944 4.997 4.390 3.963 3.643 2394 3.194 3.028 2.495 2.200 
7.774 6.038 $088 4.477 4.048 3727 3.476 3.274 2107 28 2268 
7.878 6.338 5.188 4.573 4.44) 3.818 3.566 3.363 3.195 2.652 2.348 
7.989 6.246 SL 4,676 4.244 3.920 3.667 3.483 3.204 2.749 2.444 
8.107 6.362 5.4058 4.790 4.357 4.033 3.780 3.576 3.408 2 2.56) 
7.276 = S$.360 4.352 3.730 3306 2.998 2.764 2.580 2.43) 1.974 1.739 
7.570 $574 4.53) 3.885 3444 3.122 2.877 2.683 2.526 2.044 1.795 
7.972) S912 4.817 4.135 3667 3.325 3.063 2.855 2.487 2.165 1.892 
B303 6.164 5.034 4.328 384! 3.484 3.210 2.993 2.816 2.264 1.974 
8.54) 6.360 5.204 4.480 3.980 3.612 3327 3.104 2.921 237 2.08 
8.808 6.583 5.400 4.657 4.142 3.763 3.470 3.237 3.047 2.448 2.128 
9.412 6639 5.628 4.644 4.334 3.943 3.640 32397 3.201 2.576 2.238 
9.458 7.136 5.895 S11 4.565 4.161 3.848 3.598 3.399 2739 2.383 
9.650 7.303 6.047 5.252 4.699 4.289 3.971 3.716 3.507 2.840 2.475 
9.856 7.484 6.213 5.408 4.847 4.431 4.108 3.850 3.637 2.958 2.584 
10.079 7.482 6.395 5.580 S012 4.591 4.264 4.002 3.786 3.097 2.717 
10.319 = 7.897 6.596 5.772 §.198 4.772 4.442 4177 3.959 3.264 2.882 


7.063 5.2 4.304 3.708 3.298 2.999 2.770 2.589 2.442 1.98% 1.753 
7.243 S.a4tt 4.437 3.827 3.406 3.078 2.861 2.674 2.522 2.049 1.802 
7.516 5.647 4.647 4.016 3.580 3.258 3.01) 2.814 2.653 -2.151 1,887 
7.714 §.821 4.803 4.160 3.713 3.382 3.127 2.924 2.757 2.235 1.956 
7.863 5.954 4.925 4.272 3.817 3.481 3.220 3.012 2.842 2.04 2015 
6.030 6.104 5.063 4.401 3.939 3.596 3.330 3.117 2.942 2.388 2.088 
6.216 6.275 5.222 4.55: 4.08: 3.732 3.461 3243 3.063 2.492. 2. 180 
B4260 6 A7t $407 4.727) 4.25) 3.895 3.619 3.396 3.213 2.625 2.300 
B Sal 6.579 5.511 4.827 4.348 3.990 371) 3.486 3.301 2.706 2.376 
. 6.697 5.624 4.937 4.455 4.095 3.814 3.588 3.401 2.800 2.465 
5.747 $.058 4.573 4.211 3.92% 2.702 3.514 2.910 2.573 

5.882 S.t9t 4.705 4.343 4.040 3.833 3.645 3D4t 2.705 

4.640 3960 3.498 3.163 2.908 2707 2.545 2.050 1.797 

4.829 4.124 3.642 3.272 3.026 2816 2.646 2.124 1.855 

$§.$33 4.369 3.879 3.507 3.222 2.997 2.844 2.250 1.956 

; 4065 3.676 3.379 3,143 2951 2355 2.042 

4.214 3.614 3.506 3.262 3.0639 2.443 2.115 

4.390 3.977 3.659 3.406 3199 2.551 2.206 

4.600 4.173 3.843 3.581 3.366 2.688 2.324 

4.855 4.413 4.072 3.799 3.575 2.866 2.48} 

§.003 4.554 4.207 3.929 3.701 2976 2.581 

§.169 4.713 4.360 4.078 3.844 3.106 2.700 

$§.35 4.894 4.535 4.248 4.012 .2260 2.847 

5.567 5.099 4.738 4.446 4,207 3.447 3.030 


679 


TABLE B.4 (Continued) 


$s 


p=8 
« a t 2 3 4 5 é 7 8 9 10 15 20 
9 t3.10t 7.640 5.645 4.594 3.941 3.493 3.166 2.916 2.719 2559 2.067 1.812 
21 13.346 7.834 5.808 4.737 4.067 3.607 3.2770 3.012 2808 2.643 2.130 |.863 
25 13.710 8.132 6.0863 4.962 4.270 3.792 3.ddt 3.171 2.956 2.782 2.238 1.952 
P44 13.970 8.350 6.283 5.431 4.425 3.935 3.574 3.295 3.074 2.893 2.326 2.025 
33 14.163) 8.515 6.399 5.264 4.547 4.049 3.680 3.396 3.169 2.984 2.400 2.088 
05 39 14.377 8.70! 6.566 5.416 4.6868 4.181 3.606 3.515 3.283 3.092 2.490 2.165 
49 $4614 8.912) 6.757 5.593 4.854 4.338 3 955 3.658 3.420 3.224 2,603 2.264 
9 14.877 9.15: 6.977 5.800 5.050 4.526 4.136 3.832 3.589 3.388 2.747 2.395 
89 15.021 9.283 7.10: 5.917 5.163 4.634 4.24 3.935 3.489 3.486 2.636 2.478 
129 15.173 9,426 7.235 6.045 5.287 4755 4.358 4.050 3.802 3,597 2.940 2576 
249 15. 9.579 7.38) 6.187 5.424 4.689 4.491 4.160 3.93) 3.725 3,063 2.595 
ies 15.507 9.745 7. S5at 6.342 5.577 5.040 4.640 4.329 4.078 3.872 3.210 2.643 
9 14.999 8.435 619 4.921 4185 3.685 3.323 3.048 2.632 2.4658 2.125 1.853 
21 15463 8.743) 6.357 5.89 4.3955 3.636 3.458 3.171 2.944 2.762 2.201 1.913 
25 ¥6177 «9.228 «6.739 5.439 4.634 4.084 3.682 3.376 3.134 2.938 2332 2.018 
rt 16.700 9.589 7.030 5.687 4653 4.280 3.861 3.541 3.287 3.08: 2442 2.107 
xR {7.100 9.87: 7.259 5.885 5.028 4.439 4.007 3.4676 3.414 3.200 2.535 2.184 
Ot 39 17,549 10.194 7.524 6115 5.234 4.627 4.181 3.839 3.5466 3.344 2.649 2.280 
49 18.058 10.565 7.833 6 387 5.480 4.854 4.372 037 3.754 3.523 2.795 2.405 
69 18.640 10.998 8.199 6.713 5.778 S5.13t 4.653 4.284 3.990 3.749 2.986 2.573 
89 18.962 $t.242 8.408 6 901 5.952 5.294 4808 4.432 4.132 3.886 3.105 2,480 
129 19310 1.508 8.638 7.109 6 146 5.478 4.983 4.60: 4.295 4.044 3.246 2.810 
249 19. 684 14.798 8.891 7.34t 6.364 5.685 5 183 4.794 4.483 4.228 3.415 2.970 
20 20.090 12.317 9.173 7.60: 6.60 5.922 5.413 5.019 4.703 4.445 3.622 3,171 


a };a\ym t 2 3 4 5 6 7 8 9 10 15 20 
21 15322 8.761 «6.395 5.158 4.392 3.869 3.489 3.199 2.970 2.785 2217 1.925 
23 15604 8.979 6.577 5.315 4.53: 3.994 3.602 3.303 3.047 2.875 2.285 1.980 
27 16.033 9320 6.864 5.566 4.756 4.199 3.790 3.477 3.229 3.028 2.403 2.075 
3t 16.344 9.573 7,082 5.759 4.93: 4.359 3.939 3.636 3.340 3.15! 2.500 2.156 
KU) 16.580 9.768 «= 7.252 5,912 5.07: 4.488 4.060 3.730 3.467 3.253 2, Bt 2. 225 
0s 4s 16.843 9.992 7.448 6.070 §.234 4.644 4.203 3. 3.596 3.376 2682 2.331 
5t 17.440 10.247 7.676 6.299 5.429 4.824 4.376 4.030 3.754 3.527 2.810 2.423 
7k 17.476 10.543 7.944 6.548 5.663 5.047 4.590 4.235 3.952 3.719 2977 2.572 
9 17.662 10.709 8097 6 69t 5.800 5.378 4.716 4.358 4.070 3.834 3.08! 2.668 
33% 17.861 10.870 8265 6.650 5.952 5.324 4.858 4.496 4. 3.967 3.204 2.783 
25t 18.076 1}. 087 450 7.027 6.122 $490 5.02) 4.656 4.363 4122 3350 2.925 
00 18.307 11.303 8654 7.224 6315 5.679 5.207 4.840 4.546 4.304 3.52% 3.102 
2: 17.197 9.534 6.851 5.470 4.624 4.05! 3.636 3.322 3.075 2.877 2.271 $.962 
23 17.707 9.867) 7.107 5.482 4.806 4.211 3.779 3.452 3.194 2.987 2.35) 2.025 
27 8. 10.399 7.523 6.029 S307 4.478 4.02! 3.672 3.397 3.175 2.49) 2.137 
3t 1% 101 10.805 7.846 6.302 5346 4.693 4.216 3.851 3.564 3.330 2.608 2.233 
& 19.562 33.325 8.103 6.522 5.54t 4.868 4.376 4.000 3.702 3.460 2.707 2315 
bt 4} 20.088 $1.495 8.405 6.782 $.772 5.078 4.570 4.180 3.87! 3.619 2.835 2.420 
* 5t 20.692 41.928 B76} 7.093 6.052 5.335 4.808 4.403 4.083 3.819 2996 2.558 
7\ 23.394 32.441 9.190 7.473 6.397 5.654 5.107 4.686 4.350 4.076 3.21) 2.745 
4 21.790 12.735 9.439 7.695 6.601 5.845 5.287 4.857 4.515 4.234 3.347 2.867 
13} 22.221 13.059 9.716 7.944 6.832 6.062 5.494 5.055 4.705 4.418 3.530 3.036 
25t 22.692 13.417 10.025 8.225 7.094 6.310 5.732 5.285 4.928 4.636 3.707 3.201 
2 23.209 «13.836 10.373 8.545 7.395 6.598 6.0!0 5.556 5.193 4.895 3.952 3.438 


680) 


TABLE B.5 


SIGNIFICANCE POINTS FOR THE MODIFIED LIKELIHOOD RATIO TEST OF 
EQUALITY OF COVARIANCE MATRICES BASED ON EQUAL SAMPLE SIZES 
Pr{ —2logA* = x} = 0.05 


nye\q 2 3 4 5 6 | 8 9 10 
p=2 
3 112.18 18,70 24.55 30.09 35.45 40.68 45.81 5087 55.87 
4 |10.70 16.65 22.00 27.07 31.97 36.76 41.45 46.07 50.64 
5 9.97 15.63 20.73 25.56 30.23 34,79 39.26 43.67 48.02 
6 9.53 15.02 19.97 24.66 29.19 33.61 37,95 42.22 46.45 
7 9.24 1462 1946 2405 28.49 32,82 37.07 41,26 45,40 
8 9.04 1433 19.10 23,62 27.99 32,26 36.45 40,57 44.65 
9 8.88 14.11 18.83 23.30 27.62 31.84 35.98 40.06 44.08 
10 8.76 13.94 18.61 23.05 27.33 31.51 35.61 36.65 43.64 
p=3 
5 1192 305 410 51.0 60.7 703 79.7 89.0 98.3 
6 (1757 2824 38.06 47.49 56.68 65.69 74.58 83.37 92.09 
7 116.59 26.84 36.29 45.37 54.21 62,89 71.45 79.91 88.29 
8 | 15.93 25,90 35.10 43.93 52.54 60.99 69.33 77.56 85.72 
9 115.46 25.22 34.24 42.90 5134 59.62 67.79 75.86 83.86 
10 {15.11 24.71 33.59 42,11 50.42 58.58 66.62 74.57 82.45 
11 | 1483 2431 33.08 41.50 49.71 57.76 65.71 73.56 81.35 
12 | 14.61 23.99 32,67 41.01 49.13 57.11 64.97 72.75 80.46 
13. | 1443 23.73 32.33 40.60 48.66 56.57 64.37 72.08 79.72 
p=4 
6 {30.07 4863 65.91 826 98,9 115.0 131.0 — —_ 
7 | 27.31 4469 60.90 76.56 91.89 107.0 121.9 137.0 152.0 
8 | 25.61 42,24 57.77 72.78 87.46 101.9 116.2 130.4 144.6 
9 | 24.46 40.56 55.62 70.17 8442 98.45 112.3 126.1 139.8 
10 | 23.62 39,34 5405 68.27 82.19 95.91 109.5 122.9 136.3 
11 | 22,98 38.41 52.85 66,81 80.49 93,95 107.3 120.5 133.6 
12 | 22.48 37.67 51,90 65.66 79.14 92.41 105.5 1185 1315 
13° | 22.08 37.08 51.13 64.73 78.04 91.16 104.1 117.0 129.7 
14 | 21.75 36.59 5050 63.96 77.14 90.12 103.0 115.7 1283 
15 | 21.47 36,17 49.97 63.31 76.38 127.1 


89.25 102.0 


681 


me\q 2 3 4 5 6 7 |a,\q 2 3 4 
p=5 p= 
8 {39.29 65,15 89.46 113.0 _ — 10 | 49.95 8443 117.0 
9 136.70 61.40 8463 107.2 129.3 151.5 
10 | 34.92 5879 81.25 1031 1245 145.7] 11 | 47.43 80.69 112.2 
12 | 45.56 77.90 108.6 
11 | 33.62 56,86 78.76 100.0 1209 141.6) 13 | 44.11 75.74 105.7 
12 | 32.62 55,37 76.83 97.68 1182 1384) 14 } 42.96 74.01 103.5 
13. | 31.83 5419 75.30 95.81 116.0 135.9] 15 $42.03 72.59 101.6 
14 | 31.19 5324 74.06 94.29 114.2 133.8 
15 | 30.66 52.44 73.02 93.03 112.7 1321) 16 } 41.25 71.41 100.1 
17 | 40.59 70.41 98,75 
16 | 30.21 51.77 72.14 91.95 111.4 1306! 18 | 40.02 69.55 97,63 
19 | 39.53 68.80 96.64 
20 | 39.11 68.14 95,78 


682 


TABLE B.6 


CORRECTION FACTORS FoR SIGNIFICANCE POINTS FOR THe SPHERICITY TEST 


5% Significance Level 


1.000 
1.000 
1.000 
1,000 


11,0705 


1,322 
1,122 
1,066 
1.041 
1,029 
1,021 


1.013 
1.008 
1.006 
1,005 
1,004 


1.002 
1.002 


1,001 
1,001 
1.000 
1.000 


16,9160 


1.383 
1,155 
1,088 
1.057 
1,040 


1.023 
1,015 
1.011 
1.008 
1,006 


1.004 
1 003 


1.002 
1,001 
1,001 
1.000 


23.6848 


1,420 
1.180 
1,098 
1.071 


1.039 
1,024 
1.017 
1.012 
1,010 


1,006 
1,004 


1.003 
1,002 
1,001 
1.000 


31.4104 


1,442 
1.199 
1,121 


1.060 
1,037 
1,025 
1,018 
1.014 


1.009 
1.006 


1,004 
1,002 
1,002 
1,000 


40.1133 


1.455 
1,214 


1.093 
1.054 
1.035 
1.025 
1.019 


1.012 
1.008 


1.005 
1.003 
1,002 
1.000 


49.8018 


683 


684 


1.001 
1.000 
1,000 
1.000 


15.0863 


TABLE B.6 (Contintted) 


1,396 
1.148 
1,079 
1,049 
1.034 
1,025 


1.015 
1.010 
1.007 
1.005 
1,004 


1,003 
1,002 


1.001 
1.001 
1,001 
1.000 


21,6660 


1% Significance Level 


1.471 
1.186 
1,103 
1.067 
1.047 


1.027 
1.018 
1.012 
1.009 
1.007 


1.005 
1.003 


1,002 
1.001 
1.001 
1.000 


29,1412 


1.511 
1,213 
1.123 
1.081 


1.044 
1,028 
1.019 
1.014 
1.011 


1.007 
1.005 


1.003 
1.002 
1,001 
1.000 


37.5662 


1.542 
1.234 
1.138 


1.068 
1.041 
1.028 
1.020 
1.015 


1.010 
1.007 


1,004 
1.003 
1.002 
1.000 


46.9629 


1.556 
1,250 


1.104 
1.060 
1.039 
1.028 
1,021 


1.013 
1.009 


1,006 
1.003 
1.002 
1.001 


57.3421 


TABLE B.7* 
SIGNIFICANCE POINTS FOR THE MODIFIED LIKELIHOOD RATIO TEST 2 = Zo 
Pr{—2 log At > x} = 0.05 


p= 
32.5 40.0 
31.4 38.6 


30.55 37.51 
29.92 36.72 
29.42 36,09 
29.02 35.57 
28.68 35,15 


28.40 34.79 
28.15 34.49 


27.94 34,23 
27.76 34.00 
27.60 33.79 


p=7 p=9 p= 10 
18 486 56.9 28 70.1 79.6 | 34 (82.3) (92.4) 
19 48.2 563 30 69.4 788 |36 81.7 918 
20 47.7 55.8 38 81.2 912 
21 47.34 55,36 32 68.8 78.17|40 80.7 90.7 
22 47.00 54.96 34 68.34 77.60 


36 (67.91) (77.08) 
38 (67.53) (76.65) 
40 67.21 76.29 


45 79.83 89.63 
50 79.13 88.83 
55 78.57 88.20 
60 7813 87.68 
65 77.75 87.26 


45 66.54 75.51 
$0 66.02 74.92 
55 65.61 74.44 
65.28 74.06 


70 77.44 86.89 
75 77.18 86,59 


TEntries in parentheses have been interpolated or extrapolated ynto Korin’s table. 
p = number of variates; N = number of observations; n = N — 1. A¥ = n log|Zo| - 
np — nlog|S|+ nte(SZq '), where S is the sample covariance matrix. 


685 


References 


At the end of each reference in brackets is a list of sections in which that 
reference is used. 


Abramowitz, Milton, and irene Stegun (1972), Handbook of Mathematical Functions 
with Formulas, Graphs, and Mathematical Tables, National Bureau of Standards. 
U.S, Government Printing Office, Washington, D.C. [8.5] 


Abruzzi, Adam (1950), Experimental Procedures and Criteria for Estimating and Evaluat- 
ing Industrial Productivity, doctoral dissertation, Columbia University Library. [9.7, 
9.P] 


Adrian, Robert (1808), Research concerning the probabilities of the errors which 
happen in making observations, etc., The Analyst or Mathematical Museum, 1, 
93-109, [1.2] 


Ahn, S. K., and G. C. Reinsel (1988), Nested reduced-rank autoregressive models for 
multiple time series, Journal of American Statistical Association, 83, 849~856. 
[12.7] 


Aitken, A. C. (1937), Studies in practical mathematics, II. The evaluation of the latent 
roots and latent vectors of a matrix, Proceedings of the Royal Society of Edinburgh, 
§7, 269-305. [11.4] 

Amemiya, Yasuo, and T. W. Anderson (1990), Asymptotic chi-square tests for a large 
class of factor analysis models, Annals of Statistics, 18, 1453-1463. [14.6] 

Anderson, R. L., and T. A. Bancroft (1952), Statistical Theory in Research, McGraw- 
Hill, New York. [8.P] 

Anderson, T, W. (1946a), The non-central Wishart distribution and certain problems 
of multivariate statistics, Annals of Mathematical Statistics, 17, 409-431. (Correc- 
tion, 35 (1964), 923-924.) [14.4] 


Anderson, T. W. (1946b), Analysis of multivariate variance, unpublished. [10.6] 


An Introduction to Multivariate Statistical Analysis, Third Edition. By T. W. Anderson 
ISBN 0-471-36091-0 Copyright © 2003 John Wiley & Sons, Inc. 


687 


688 REFERENCES 


Anderson, T. W. (1950), Estimation of the parameters of a single equation by tne 
limited-information maximum-likelihood method, Statistical Inference in Dynamic 
Economic Models (Tjalling C. Koopmans, ed.) John Wiley & Sons, Inc., New 
York, [12.8] 


Anderson, T. W. (1951a), Classification by multivariate analysis, Pyschometrika, 16, 
31-50. [6.5, 6.9] 


Anderson, T. W. (1951b), Estimating linear restrictions on regression coefficients for 
multivariate normal distributions, Annals of Mathematical Statistics 22, 327-351. 
(Correction, Annals of Statistics, 8 (1980), 1400.) [12.6, 12.7] 


Anderson, T. W. (1951c), The asymptotic distribution of certain characteristic roots 
and vectors, Proceedings of the Second Berkeley Symposium on Mathematical 
Statistics and Probability (Jerzy Neyman, ed.), University of California, Berkeley, 
105-130. [12.7] 


Anderson, T. W. (1955a), Some statistical problems in relating experimental data to 
predicting performance of a production process, Journal of the American Statisti- 
cal Association, 50, 163-177. [8.P] 


Anderson, T. W. (1955b), The integral of a symmetric unimodal function over a 
symmetric convex set and some probability inequalities, Proceedings of the Ameri- 
can Mathematical Society, 6, 170-176. [8.10] 


Anderson, T. W. (1957), Maximum likelihood estimates for a multivariate normal 
distribution when some obervations are missing, Journal of the American Statisti- 
cal Association, 52, 676-687. [4.P] 


Anderson, T. W. (1963a), Asymptotic theory for principal component analysis, Annals 
of Mathematical Statistics, 34, 122-148. [11.3, 11.6, 11.7, 13.5] 


Anderson, T. W. (1963b), A test for equality of means when covariance matrices are 
unequal, Annals of Mathematical Statistics, 34, 671-672. [5.P] 


Anderson, T. W. (1965a), Some optimum confidencc bounds for roots of determinan- 
tal equations, Annals of Mathematical Statistics, 36, 468-488. [11.6] 


Anderson, T. W. (1965b), Some properties of confidence regions and tests of parame- 
ters in multivariate distributions (with discussion), Proceedings of the IBM Scien- 
tific Computing Symposium in Statistics, October 21-23, 1963, IBM Data Process- 
ing Division, White Plains, New York, 15-28. [11.6] 

Anderson, T. W. (1969), Statistical inference for covariance matrices with linear 


structure, Multivariate Analysis If (P. R. Krishnaiah, ed.), Academic, New York, 
55-66. [3.P] 


Anderson, T. W. (1971), The Statistical Analysis of Time Series, John Wiley & Sons, 
Inc., New York. [8.11] 

Anderson, T. W. (1973a), Asymptotically efficient estimation of covarianve matrices 
with linear structure, Annals of Statistics, 1, 135-141. [14.3] 

Anderson, T. W. (1973b), An asymptotic expansion of the distribution of the Studen- 
tized classification statistic W, Annals of Statistics, 1, 964-972. [6.6] 


Anderson, T. W. (1973c), Asymptotic evaluation of the probabilities of misclassifica- 
tion by linear discriminant functions, Discriminant Analysis and Applications 
(T. Cacoullos, ed.), Academic, New York, 17-35. [6.6] 


REFERENCES 689 


Anderson, T. W. (1974), An asymptotic expansion of the distribution of the limited 
information maximum likelihood estimate of a coefficient in a simultaneous 
equation system, Journal of the American Statistical Association, 69, 565-573. 
(Correction, 71 (1976), 1010.) [12.7] 


Anderson, T. W. (1976), Estimation of linear functional relationships: approximate 
distributions and connections with simultaneous equations in econometrics (with 
discussion), Journal of the Royal Statistical Society B, 38, |—36. [12.7] 


Anderson, T. W. (1977), Asymptotic expansions of the distributions of cstimates in 
simultaneous equations for alternative parameter sequences, Economenrrica, 45, 
509-518. [12.7] 


Anderson, T. W. (1984a), Estimating Knear statistical relationships, Avna/ys of Srats- 
tics, 12, 1-45. [10.6, 12.6, 12.7, 14.1] 


Anderson, T. W. (1984b), Asymptotic distribution of an estimator of linear functional 
relationships, unpublished. [12.7] 


Anderson, T. W. (1987), Multivariate linear relations, Proceedings of the Svcond 
International Tampere Conference in Statistics (Tarmo Pukkila and Simo Puntanen. 
eds.), Tampere, Finland, 9-36. [12.7] 


Anderson, T. W. (1989a), The asymptotic distribution of the likelihood ratio criterion 
for testing rank in multivariate components of variance, Journal of Mulrivariate 
Analysis, 30, 72-79. [10.6, 12.7] 


Anderson, T. W. (1989b), Th: asymptotic distribution of characteristic roots and 
vectors in multivariate components of variance, Contributions to Probability and 
Statistics: Essays in Honor of Ingram Olkin (Leon Jay Gleser, Michael D. Perlman, 
S. James Press, and Allin R. Sampson, eds.), Springer Verlag, New York. 
177-196. [13.6] 


Anlerson, T, W. (1989c), Linear latent variable models and covariance structures. 
Journal of Econometrics, 41, 91-119, [Correction 43 (1990), 395,] [12.8] 

Anderson, T. W. (1993), Nonnormal multivariate distributions: Inference based on 
elliptically contoured distributions, Mu/tivariate Analysis: Future Directions (C. R. 
Rao, ed.), North-Holland, Amsterdam, 1-25. [3.6] 


Anderson, T. W. (1994), Inference in linear models, Proceedings of the Internarionai 
Symposium on Multivariate Analysis and its Applications (T, W. Anderson, K. T. 
Fang, and I. Olkin, eds.), Institute of Mathematical Statistics, 1-20. [12.7, 12.8, 
14.3] 


Anderson, T. W. (1999a), Asymptotic theory for canonical correlation analysis, /our- 
nal of Multivariate Analysis, 70, |—29. [13.7] 

Anderson, T. W. (1999b). Asymptotic distribution of the reduced rank regression 
estimator under general conditions, Annals of Statistics, 27, 1141-1154. [13.7 
Anderson, T. W. (2002), Specification and misspecification in reduced rank regres- 

sion, Sankhya, 64A, 1-13. [12.7] 


Anderson, T. W., and Yasuo Amemiya (1988a), The asymptotic normal distribution of 
estimators in factar analysis under general conditions. analy of Statistics. 16, 
759-771, [14.6] 


690 REFERENCES 


Anderson, T. W., and Yasuo Amemiya (1988b), Asymptotic distributions in factor 
analysis and linear structural relations, Proceedings of the International Conference 
on Advances in Multivariate Statistical Analysis (S. Das Gupta and J. K. Ghosh, 
eds.), Indian Statistical Institute. Calcutta, 1-22. [14.6] 


Anderson, T. W., and R. R. Bahadur (1962), Classification into two multivariate 
normal distuibutions with different covariance matrices, Annals of Mathematical 
Statistics. 33, 420-431. [6.10] 


Anderson. T. W., and Kai-Tai Fang (1990a), On the theory of multivariate elliptically 
contoured distributions and their applications, Statistical Inference in Elliptically 
Contoured and Related Distributions (Kai-Tai Fang and T. W. Anderson, eds.), 
Allerton Press, Inc., New York, 1-23. [4.5] 


Anderson. T. W., and Kai-Tai Fang (1990b), Inference in multivariate elliptically 
contoured distributions based on maximum likelihood, Statistical Inference in 
Elliptically Contoured and Related Distributions (Kai-Tai Fang and T. W. Ander- 
son, eds.). Allerton Press, Inc., New York, 201-216. [3.6] 


Anderson. T. W., and Kai-Tai Fang (1992), Theory and applications of elliptically 
contoured and related distributions, The Development of Statistics: Recent Contri- 
butions from China (X. R. Chen, K. T. Fang, and C. C. Yang, eds.), Longman 
Scientific and Technical, Harlow, Essex, 41-62. [10.11] 


Anderson, T. W., Kai-Tai Fang, and Huang Hsu (1986), Maximum likelihood esti- 
mates and likelihood-ratio criteria for multivariate elliptically contoured distribu- 
tions, Canadian Journal of Statistics, 14, 55-59. [Reprinted in Statistical Inference 
in Elliptically Contoured and Related Distributions (Kai-Tai Fang and T. W. 
Anderson, eds.), Allerton Press, Inc., New York, 1990, 217-223. [3.6] 


Anderson. T. W., Somesh Das Gupta, and George P. H. Styan (1972), A Bibliography 
of Multivariate Statistical Analysis, Oliver & Boyd, Edinburgh. (Reprinted by 
Robert E. Krieger, Malabar, Florida, 1977.) [Preface] 


Anderson, T. W., Naoto Kunitomo, and Takamitsu Sawa (1983a), Evaluation of the 
distribution function of the limited information maximum likelihood estimator, 
Econometrica, 50, 1009-1027. [12.7] 


Anderson, T. W., Naoto Kunitomo, and Takamitsu Sawa (1983b), Comparison of the 
densities of the TSLS and LIMLK estimators, Global Econometrics, Essays in 
Honor of Lawrence R. Klein (F. Gerard Adams and Bert Hickman, eds.), MIT, 
Cambridge, MA, 103~124. [12.7] 


Anderson, T. W., and I. Olkin (1985), Maximum likelitood estimation of the parame- 
ters of a multivariate normal distribution, Linear Algebra and Its Applications, 70, 
147-171. [3.2] 


Anderson, T. W., and Michael D. Perlman (1987), Consistency of invariant tests for 
the multivariate analysis of variance, Proceedings of the Second [nternational 
Tampere Conference in Statistics (Tarmo Pukkila and Simo Puntaner, eds.), 
Tampere, Finland, 225-243. [8.10] 


Anderson, T. W., and Michael D. Perlman (1993), Parameter consistency of invariant 
tests for MANOVA and related multivariate hypotheses, Statistics and Probability: 
A Raghu Raj Bahadur Festschift (J. K. Ghosh, $. K. Mitra, K. R. Parthasarathy, 
and B. L. §. Prakasa Rao, eds.), Wiley Eastern Limited, 37-62. [8.10] 


REFERENCES 691 


Anderson, T. W., and Herman Rubin (1949), Estimation of the parameters of a single 
equation in a complete system of stochastic equations, Annals of Mathematical 
Statistics, 20, 46-63. [Reprinted in Readings in Econometric Theory (J. Malcolm 
Dowling and Fred R. Glahe, eds.), Colorado Associated University, 1970, 
358-375.] [12.8] 

Anderson, T. W., and Herman Rubin (1950), The asymptotic properties of estimates 
of the parameters of a single equation in a complete system of stochastic 
equations, Annals of Mathematical Statistics, 21, 570-582. [Reprinted in Readings 
in Econometric Theory (J. Malcolm Dowling and Fred R. Glahe, eds.), Colorado 
Associated University, 1970, 376-388.) [12.8] 

Anderson, T. W., and Herman Rubin (1956), Statistical inference in factor analysis, 
Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Proba- 
bility lerzy Ncyman, ed.), Vol. V, University of California, Berkeley and Los 
Angeles, 111-150, [14.2, 14.3, 14.4, 14.6] 

Anderson, T. W., and Takamitsu Sawa (1973), Distributions of estimates of coeffi- 
cients of a single equation in a simultaneous system and their asymptotic 
expansiuns, Econometrica, 41, 683-714. [12.7] 

Anderson, T. W., and Takamitsu Sawa (1977), Tables of the distribution of the 
maximum likelihood estimate of the slope coefficient and approximations, Tech- 
nical Report No. 234, Economics Series, Institute for Mathematical Studies in 
the Social Sciences, Stanford University, April. [12.7] 

Anderson, T. W., and Takamitsu Sawa (1979), Evaluation of the distribution function 
of the two-stage least squares estimate, Econometrica, 47, 163-182. [12.7] 

Anderson, T. W., and Takamitsu Sawa (1982), Exact and approximate distributions of 
the maximum likelihood estimator of a slope coefficient, Journal of the Royal 
Statistical Society B, 44, 52-62. [12.7] 

Anderson, T. W., and George P. H. Styan (1982), Cochran’s theorem, rank additivity 
and tripotent matrices, Statistics and Probability: Essays in Honor of C. R. Rao 
(G. Kallianpur, P. R. Krishnaiah, and J. K. Ghosh, eds.), North-Holland, Amster- 
dam, 1-23. [7.4] 

Anderson, T. W., and Akimichi Takemura (1982), A new proof of admissibility of 
tests in multivariate analysis, Journal of Multivariate Analysis, 12, 457-468. [8.10] 

Andersson, Steen A., David Madigan, and Michael D. Perlman (2001), Alternative 
Markov properties for chain graphs, Scandinavian Journal of Statistics, 28, 33-85. 
[15.2] 

Bamard, M. M. (1935), The secular variations of skull characters in four series of 
Egyptian skulls, Annals of Eugenics, 6, 352-371. [8.8] 

Barndorff-Nielson, O. E. (1978), Information and Exponential Families in Statistical 
Theory, John Wiley & Sons, New York. [15.5] 

Barnes, E. W. (1899), The theory of the gamma function, Messenger of Mathematics, 
29, 64-129. [8.5] 

Bartlett, M. S. (1934), The vector representation of a sample, Proceedings of the 
Cambridge Philosophical Society, 30, 327-340. [8.3] 

Bartlett, M. S. (1937a), Properties of sufficiency and statistical tests, Proceedings of the 
Royal Society of London A, 160, 268-282. [10.2] 


692 REFERENCES 


Bartlett, M. S. (1937b), The statistical conception of mental factors, British Journal of 
Psychology, 28, 97-104. [14.7] 

Bartlett, M. S. (1938), Further aspects of the theory of multiple regression, Proceed- 
ings of the Cambridge Philosophical Society, 34, 33-40. [14.7] 

Bartlett, M. S. (1939), A note on tests of significance in multivariate analysis, 
Proceedings of the Cambridge Philosophical Society, 35, 180-185. [7.2, 8.6] 

Bartlett, M. S. (1947), Multivariate analysis, Journal of the Royal Statistical Society, 
Supplement, 9, 176-197. [8.8] 

Bartlett, M. S. (1950), Tests of significance in factor analysis, British Journal of 
Psychology (Statistics Section), 3, 77-85. [14.3] 

Basmann, R. L. (1957), A generalized classical method of linear estimation of 
coefficients in a structural equation, Econometrica, 25, 77-83. [12.8] 


Basmann, R. L. (1961), A note on the exact finite sample frequency functions of 
generalized classical linear estimators in two leading overidentified cases, Journal 
of the American Statistical Association, 56, 619-636. [12.8] 

Basmann, R. L. (1963), A note on the exact finite sample frequency functions of 
generalized classical linear estimators in a leading three-equation case, Journal of 
the American Statistical Association, 58, 161-171. [12.8] 

Bennett, B. M. (1951), Note on a solution of the generalized Behrens—Fisher 
problem, Annals of the Institute of Statistical Mathematics, 2, 87-90. [5.5] 

Berger, J. O. (1975), Minimax estimation of location vectors for a wide class of 
densities, Annals of Statistics, 3, 1318-1328. [3.5] 


Berger, J. O. (1976), Admissibility results for generalized Bayes estimators of coordi- 
nates of a location vector, Annals of Statistics, 4, 334-356. [3.5] 


Berger, J. O. (1980a), A robust generalized Bayes estimator and confidence region for 
a multivariate normal mean, Annals of Statistics, 8, 716-761. [3.5] 


Berger, J. O. (1980b), Statistical Decision Theory, Foundations, Concepts and Methods, 
Springer-Verlag, New York. [3.4, 6.2, 6.7] 


Berndt, Ermst R., and Eugene Savin (1977), Conflict among criteria for testing 
hypotheses in the multivariate linear regression model, Econometrica, 45, 
1263-1277. [8.6] 


Bhattacharya, P. K. (1966), Estimating the mean of a multivariate normal population 
with general quadratic loss function, Annals of Mathematical Statistics, 37, 
1819-1824. [3.5] 

Bimbaum, Allan (1955), Characterizations of complete classes of tests of some 
multiparametric hypotheses, with applications to likelihood ratio tests, Annals of 
Mathematical Statistics, 26, 21-36. [5.6, 8.10] 


Bjérck, A., and G. Golub (1973), Numerical methods for computing angles between 
linear subspaces, Mathematics of Computation, 27, 579-594, [12.3] 

Blackwell, David, and M. A. Girshick (1954), Theory of Games and Statistical Deci- 
stons, John Wiley & Sons, New York. [6.2, 6.7] 

Blalock, H. M., Jr (ed.) (1971), Causal Models in the So sial Sciences, Aldine-Atheston, 
Chicago. [15.1] 


Bonnesen, T., and W. Fenchel (1948), Theorie der Konvexen Koérper, Chelsea, New 
York. [8.10] 


REFERENCES 693 


Bose, R. C. (1936a), On the exact distribution and moment-coefficients of the 
D?-statistic, Sankya, 2, 143-154. [3.3] 

Bose, R. C. (1936b), A note on the distribution of differences in mean values of two 
samples drawn from two multivariate normally distributed populations, and the 
definition of the D®-statistic, Sankhya, 2, 379-384. [3.3] 

Bose, R. C,, and S, N. Roy (1938), The distribution of the studentised P°-statistic. 
Sankhya, 4, 19-38. [5.4] 

Bowker, A. H. (1960), A representation of Hotelling’s T* and Anderson's classifica- 
tion statistic W in terms of simple statistics, Contributions to Probability and 
Statistics (I, Olkin, S. G. Ghurye, W. Hoeffding, W. G. Madow and H. B. Mann. 
eds.), Stanford University, Stanford, California, 142-149, [5.2] 

Bowker, A. H., and R. Stigreaves (1961), An asymptotic expansion for the distribution 
function of the W-classification statistic, Studies in lem Analysis and Prediction 
(Herbert Solomon, ed.), Stanford University, Stanford, California, 293-310. [6.6] 

Box, G. E. P. (1949), A general distribution theory for a class of likelihood criteria, 
Biometrika, 36, 317-346. [8.5, 9.4, 10.4, 10.5] 

Bravais, Auguste (1846), Analyse mathématique sur les probabilités des erreurs de 
situation d’un point, Mémoires Présentés par Divers Savants a |’ Académie Royale 
des Sciences de I’ Institut de France, 9, 255-332. [1.2] 

Brown, G. W. (1939), On the power of the L, test for equality of several variances, 
Annals of Mathematical Statistics, 10, 119-128. [10.2] 

Browne, M. W. (1974), Generalized least squares estimates in the analysis of covari- 
ance structures, South African Statistical Journal, 8, 1-24. [14.3] 

Chambers, Juhn M. (1977), Computational Methods for Data Analysis, John Wiley & 
Sons, New York. [12.3] 

Chan, Tony F., Gene H. Golub, and Randall J. LeVeque (1981), Algorithms for 
computing the sample variance: analysis and recommendations, unpublished. 
[3.2] 

Chernoff, Herman (1956), Large sample theory: parametric case, Annals of Mathemat- 
ical Statistics, 27, 1-22. [8.6] 

Chernoff, Herman, and N. Divinsky (1953), The computation of maximum likelihood 
estimates of linear structural equations, Studies in Econometric Method (W. C. 
Hood and T. C. Koopmans, eds,), John Wiley & Sons, Inc., New York. [12.8] 

Chu, S. Sylvia, and K. C. S. Pillai (1979), Power comparisons of two-sided tests of 
equality of two covariance matrices based on six criteria. Annals of the Institute of 
Statistical Mathematics, 31, 185~205. [10.6] 

Chung, Kai Lai (1974), A Course in Probabilin' Theory, 2nd ed., Academic, New York. 
[2.2] 

Clemm, D. S., P. R. Krishnaiah, and V. B. Waikar (1973). Tables for the extreme 
roots of the Wishart matrix, Journal of Statistical Computation and Simulation. 2. 
65-92. [10.8] 

Clunies-Ross, C. W., and R. H. Riffenburgh (1960), Geometry and linear discrimina- 
tion, Biometrika, 47, 185-189. [6.10] 

Cochran, W. G, (1934), The distribution of quadratic forms in a normal system, svith 


applications to the analysis of covariance, Proceedings of the Cambridge Philosoph- 
ical Society, 30, 178-191. [7.4] 


694 REFERENCES 


Constantine, A. G. (1963), Some non-central distribution problems in multivariate 
analysis, Annals of Mathematical Statistics, 34, |270—1285. [12.4] 


Constantine, A. G. (1960), The distribution of Hotclling’s generalised T?, Annals of 
Mathematical Statistics, 37, 215-225. [8.6] 


Consul. P. C. (1966), On the exact distributions of the likelihood ratio criteria for 
testing linear hypotheses about regression coefficients, Annals of Mathematical 
Statistics, 37, 1319-1330. [8.4] 


Consul, P. C. (1967a), On the exact distributions of likelihcod ratio criteria for testng 
independence of sets of variates under the null hypotliesis, Annals of Mathemati- 
cal Statistics, 38, 1160-1169. [9.3] 


Consul, P. C. (1967b), On the exact distributions of the criterion W for testing 
sphericity and in a p-variate normal distribution, Annals of Mathematical Statis- 
tics, 38, 1170-1174, [10.7] 


Courant, R., and D. Hilbert (1953), Methods of Mathematical Physics, Interscience, 
New York. [9.10] 


Cox. D. R., and N. Wermuth (1996), Multivariate Dependencies, Chapman and Hall, 
London. [15.1] 

Cramér. H. (1946). Mathematical Methods of Statistics, Princeton University, Prince- 
ton. [2.6, 3.2, 3.4, 6.5, 7.2] 

Dahm. P. Fred, and Wayne A, Fuller (1981), Generalized least squares estimation of 
the functional multivariate linear errors in variables model, unpublished. [14.3] 

Daly, J. F. (1940), On the unbiased character of likelihaod-ratio tests for indepen- 
dence in normal systerns, Annals of Mathematical Statistics, 11, |—32, [9.2] 


Darlington. R. B., S. L. Weinberg, and H. J. Walberg (1973), Canonical variate 


analysis and related techniques, Review of Educational Research, 43, 433-454. 
[12.2] 


Das Gupta. Somesh (1965), Optimum classification rules for classification into two 
multivariate normal populations, Annals of Mathematical Statistics, 36, 1174-1184. 
[6.6] 


Das Gupta. S., T. W. Anderson, and G. §. Mudholkar (1964), Monotonicity of the 
power functions of some tests of the multivariate linear hypothesis, Annals of 
Mathernatical Statistics, 35, 200-205. [8.10] 


David, F. N. (1937), A note on unbiased limits for the correlation coefficient, 
Biometrika. 29, 157-160. [4.2] 


David. F. N. (1938). Tables of the Ordinates and Probability Integral of the Distribution of 
the Correlation Coefficient in Small Samples, Cambridge University, Cambridge. 
[4.2] 

Davis, A. W. (1968), A system of linear differential equations for the distribution of 
Hotelling’s generalized Tj, Annals of Mathematical Statistics, 39, 815-832. [8.6] 

Davis, A. W. (1970a), Exact distributions of Hotelling’s generalized Tj, Biometrika, 
57. 187-191. [Preface, 8.6} 

Davis, A. W. (1970b), Further applications of a differential equation for Hotelling’s 


generalized TZ, Annals of the Institute of Statistical Mathematics, 22, 77-87. 
[Preface. 8.6] 


Davis. A. W. (1971), Percentile approximations for a class of likelihood ratio criteria, 
Biometrika, 58. 349-356. [10.8, 10.9] 


REFERENCES 695 


Davis, A. W. (1972a), On the marginal distributions of the latent roots of the 
multivariate beta matrix, Annals of Mathematical Statistics, 43, 1664-1670. [8.6] 

Davis, A. W. (1972b), On the distributions of the latent roots and traces of certain 
random matrices, Journal of Multivariate Analysis, 2, 189-200. [8.6] 

Davis, A. W. (1980), Further tabulation of Hotelling’s generalized Tf, Communica- 
tions in Statistics, B9, 321-336. [Preface, 8.6] 

Davis, A. W., and J. B, F. Field (1971), Tables of some multivariate test criteria, 


Technical Report No. 32, Division of Mathematical Statistics, C.S.LR.O., 
Canberra, Australia, [10.8] 


Davis, Harold T. (1933), Tables of the Higher Mathematical Functions, Vol. I, Principia 
Press, Bloomington, Indiana. [8.5] 

Davis, Harold T, (1935), Tables of the Higher Mathematical Functions, Vol. II, Principia 
Press, Bloomington, Indiana. [8.5] 

Deemer, Walter L., and Ingram Olkin (1951), The Jacobians of certain matrix 
transformations useful in multivariate analysis, Based on lectures of P, L. Hsu at 
the University of North Carolina, 1947, Biometrika, 38, 345-367, [13.2] 


De Groot, Morris H. (1970), Optimal Statistical Decisions, McGraw-Hill, New York. 
[3.4, 6.2, 6.7] 

Dempster, A. P. (1972), Covariance selection, Biometrics, 28, 157-175. [15.5] 

Dempster, A. P., N. M. Laird, and D. B. Rubin (1977), Maximum likelihood from 
incomplete data via the EM algorithm (with discussion), Journal of the Royal 
Statistical Society B, 39, \—38,. [14.3] 

Diaconis, Persi, and Bradley Efron (1983), Compuler-intensive methods in statistics, 
Scientific American, 248, 116-130. [4.3] 

Eaton, M. L., and M. D, Perlman (1974), A monotonicity property of the power 
functions of some invariant fests for MANOVA, Annals of Statistics, 2, 1022-1028. 
[8.10] 


Edwards, D. (1995), Introduction to Graphical Modelling, Springer-Verlag, New York. 
[15.1] 


Efron, Bradley (1982), The Jackknife, the Bootstrap, and Other Resampling Plars, 
Society for Industrial and Applied Mathematics, Philadelphia. [4.2] 

Efron, Bradley, and Carl Morris (1977), Stein’s paradox in statistics, Scientific Amert- 
can, 236, 119-127. [3.5] 

Elfving, G. (1947), A simple method of deducing certain distributions connected with 
multivariate sampling, Skandinavisk Aktuarietidskrift, 30, 56-74. [7.2] 

Fang, Kai-Tai, Samuel Kotz, and Kai-Wang Ng (1990), Symmetric Multivariate and 
Related Distributions, Chapman and Hall, New York. [2.7] 

Fang, Kai-Tai, and Yad-Ting Zhang (1990), Generalized Multivariate Analysis, 
Springer-Verlag, New York. [2.7, 3.6, 10.11, 7.9] 

Ferguson, Thomas Shelburne (1967), Mathematical Statistics: A Decision Theoretic 
Approach, Academic, New York. [3.4, 6.2, 6.7] 

Fisher, R. A. (1915), Frequency distribution of the values of the correlation coeffi- 
cient in samples from an mdefinitely large population, Biometrika, 10, 507-521. 
[4.2, 7.2] 


Fisher, R. A. (1921), On the “probable error” of a coefficient of correlation deduced 
from a small sample, Metron, 1, Part 4, 3-32. [4.2] 


696 REFERENCES 


Fisher, R. A. (1924), The distribution of the partial correlation coefficient, Metron, 3, 
329-332. [4.3] 

Fisher, R. A. (1928), The general sampling distribution of the multiple correlation 
coefficient, Proceedings of the Royal Society of London, A, 121, 654-673. [4.4] 


Fisher, R. A. (1936), The use of multiple measurements in taxonomic problems, 
Annals of Eugenics, 7, 179-188. [5.3, 6.5, 11.5] 


Fisher, R. A. (1939), The sampling distribution of some statistics obtained from 
non-linear equations, Annals of Eugenics, 9, 238-249. [13.2] 


Fisher, R A. (1947a), The Design of Experiments (4th ed.), Oliver and Boyd, 
Fdinturgh. [8.9] 


Fisher, R. A. (1947b), The analysis of covariance method for the relation between a 
part and the whole, Biomerrics, 3, 65-68. [3.P] 

Fisher, R. A., and F, Yates (1942), Statistical Tables for Biological, Agricultural and 
Medical Research (2nd ed.), Oliver and Boyd, Edinburgh. [4.2] 

Fog, David (1948), The geometrical method in the theory of sampling, Biometrika, 35, 
46-54. [7.2] 

Foster, F. G. (1957), Upper percentage points of the generalized beta d'stribution, II, 
Biometrika, 44, 441-453. [8.6] 

Foster, F. G. (1958), Upper percentage points of the generalized beta distribution, III, 
Biometrika, 45, 492-503. [8.6] 

Foster, F. G., and D. H. Rees (1957), Upper percentage points of the generalized beta 
distribution, I, Biometrika, 44, 237-247. [8.6] 

Frets, G. P. (1921), Heredity of head form in man, Genetica, 3, 193-384. [3.P] 


Frisch, R. (1929), Correlation and scatter in statistical variables, Nordic Statistical 
Journal, 8, 36-102. [7.5] 


Frydenberg, M. (1990), The chain graph Markov property, Scandinavian Journal of 
Statistics, 17, 333-353. [15.2] 


Fujikoshi, Y, (1973), Monotonicity of the power functions of some tests in general 
MANOVA models, Annals of Statistics, 1, 388-391. [8.6] 


Fujikoshi, Y. (1974), The likelihood ratio tests for the dimensionality of regression 
coefficients, Journal of Multivariate Analysis, 4, 327-340. [12.4] 


Fujikoshi, Y., and M. Kanazawa (1976), The ML classification statistic in covariate 
discriminant analysis and its asymptotic expansions, Essays in Probability and 
Statistics, 305-320. [6.6] 


Fuller, Wayne A., Sastry, G. Pantula, and Yasuo Amemiya (1982), The covariance 
matrix of estimators for the factor model, unpublished. [14.4] 


Gabriel, K. R. (1969), Simultaneous test procedures—some theory of multiple com- 
parisons, Annals of Mathematical Statistics, 40, 224-250. [8.7] 


Gajjar, A. V. (1967), Limiting distributions of certain transformations of multiple 
correlation coefficients, Metron, 26, 189-193. [4.4] 


Galton, Francis (1889), Natural Inheritance, MacMillan, London. [1.2, 2,5] 
Gauss, K. F. (1823), Theory of the Combination of Observations, Gottingen. [1.2] 
Giri, N. (1977), Multivariate Statistical Inference, Academic, New York, [7.Z, 7.7] 


Giri, N., and J. Kiefer (1964), Local and asymptotic minimax properties of multivari- 
ate tests, Annals of Mathematical Statistics, 35, 21-35. [5.6] 


REFERENCES 697 


Giri, N., J. Kiefer, and C. Stein (1963), Minimax character of Hotelling’s 7° test in 
the simplest case, Annals of Mathematical Statistics, 34, 1524-1535, [5.6] 

Girshick, M. A. (1939), On the sampling theory of roots of determinantal equations. 
Annals of Mathematical Statistics, 10, 203-224. [13.2] 

Gleser, Leon Jay (1981), Estimation in a multivariate “errors in variables” regression 
model: large sample results, Annals of Statistics, 9, 24-44, [12.7] 

Glynn, W. J., and R. J. Muirhead (1978), Inference in canonical correlation analysis. 
Journal of Multivariate Analysis, 8, 468-478, [12.4] 

Golub, Gene H., and Franklin T. Luk (1976), Singular value decomposition: applica- 
tions and computations, unpublished. [12.3] 

Golub, Gene H., and Charles F. Van Loan (1989), Matrix Computations (2nd ed.). 
Johns Hopkins University Press, Baltimore. [11.4, 12.3, A.5] 


Grubbs, F. E. (1954), Tables of 1% and 5% probability levels of Hotclling’s general- 
ized J? statistics, Technical Note No. 926, Ballistic Research Laboratory, 
Aberdeen Proving Ground, Maryland. [8.6] 

Gupta, Shanti S. (1963), Bibliography on the multivariate normal integrals and related 
topics, Annals of Mathematical Statistics, 34, 829-838. [2.3] 

Gurland, John (1968), A relatively simple form of the distribution of the multiple 
correlation coefficient, Journal of the Royal Statistical Society B, 30. 276-283. [4.4] 

Gurland, J., and R. Milton (1970), Further consideration of the distribution of the 
multiple correlation coefficient, Journal of the Royal Statistical Sociery B, 32. 
381-394, [4.4] 

Haavelmo, T. (1944), The probability approach in econometrics, Econometrica, 12. 
Supplement, 1-118, [12.7] 

Haff, L. R. (1980), Empirical Bayes estimation of the multivariate normal covariance 
matrix, Annats of Statistics, 8, 586-597. [7.8] 

Halmos, P. R. (1950), Measure Theory, D. van Nostrand. New York. [4.5, 13.3] 

Harris, Bernard, and Andrew P, Soms (1980). The use of the tetrachoric series for 
evaluating multivariate rormal probabilities, Journal of Multivariate Analysis, 10. 
252-267. [2.3] 

Hayakawa, Takesi (1967), Cn the distribution of the maximum latent root of a 
positive definite symmetric random matrix, Annals of the Institute of Statistical 
Mathematics, 19, 1-17. [8.6] 

Heck, D. L. (1960), Charts of some upper percentage points of the distribution of the 
largest characteristic root, Annals of Mathematical Statistics, 31, 625-642. [8.6] 

Hickman, W. Braddock (1953), The Volume of Corporate Bond Financing Since 1900, 
Princeton University, Princeton, 82-90. [10.7] 

Hoel, Paul G, (1937), A significance test for component analysis, .4inals of Mathemai- 
ical Statistics, 8, 149-158. [7.5] 

Hooker, R. H. (1907), The correlation of the weather and crops, Journal of the Royal 
Statistical Society, 70, 1-42. [4,2] 

Hotelling, Harold (1931), The generalization of Student's ratio. Annals of Mathemari- 
cal Statistics, Z, 360-378. [5.1, 5,P] 

Hotelling, Harold (1933), Analysis of a complex of statistical variables into principal 
components, Journal of Educational Psychology, 24, 417-441. 498-520. [11.1, 14.3] 


698 REFERENCES 


Hotelling. Harold (1935), The most predictable criterion, Journal of Educational 
Psychology, 26, 139-142. [12.2] 

Hotelling, Harold (1936), Relations between two sets of variates, Biometrika, 28, 
321-377. [12.1] 


Hotelling. Harold (1947), Multivariate quality control, illustrated by the air testing of 
sample bombsights. Techniques of Statistical Analysis (C, Eisenhart, M. Hastay, 
and W. A. Wallis, eds.). McGraw-Hill, New York, [11-184 [8.6] 


Hotelling. Harold (1951). A generalized T test and measure of multivariate disper- 
sion. Proceedings of the Second Berkeley Symposium on Mathematical Statistics 
and Probability (Jerzy Neyman. ed.), University of California, Los Angeles and 
Berkeley. 23-41. [8.6. 10.7] 


Hotelling, Harold (1953). New light on the correlation coefficient and its transforms 
(with discussion), Journal of the Royal Statistical Society B, 15, 193-232, [4,2] 


Howe, W. G. (1955). Some Contributions to Factor Analysis, U.S, Atomic Energy 
Commission Report. Oak Ridge National Laboratory, Oak Ridge, Tennessee. 
[14.2, 14.6] 


Hsu, P. L. (1938), Notes on Hotelling’s generalized T, Annals of Mathematical 
Statistics. 9, 231--243. [5.4] 


Hsu. P. L. (1939a), On the distribution of the roots of certain determinantal equa- 
tions, Annals of Eugenics, 9, 250-258. [13.2] 


Hsu. P. L. (1939b), A new proof of the joint product moment distribution, Proceedings 
of the Cambridge Philosophical Society, 35, 336-338. [7.2] 


Hsu, P. L. (1945), On the power functions for the E?-test and the T*-test, Annals of 
Mathematical Staustics, 16, 278-286. [5.6] 


Hudson. M. (1974), Empirical Bayes estimation, Technical Report No, 58, NSF 
contract GP 30711X-2, Department of Statistics, Stanford University, [3.5] 


Immer, F. R.. H. D. Hayes, and LeRoy Powers (1934), Statistical determination of 
barley varietal adaptation, Journal of the American Society of Agronomy, 26, 
403-407. [8.9] 


Ingham, A. E. (1933), An integral which occurs in statistics, Proceedings of the 
Cambridge Philosophical Society, 29, 271-276. [7.2] 


Ito, K. (1956), Asymptotic formulae for the distribution of Hotelling’s generalized Tj 
statistic, .4nnals of Mathematical Statistics, 27, 1091-1105. [8.6] 

lto. K. (1960), Asymptotic formulae for the distribution of Hotelling’s generalized Tj 
staustic, IL, Annals of Mathematical Statistics, 31, 1148-1153, [8.6] 

lzenman, A. J. (1975), Reduced-rank regression for the multivariate linear model, 
Journal of Multivariate Analysis, 5, 248-264. [12.7] 


lzenman, Alan Julian (1980), Assessing dimensionality in multivariate regression, 
Analvsis of Variance, Handbook of Statistics, Vol. 1 (P. R. Krishnaiah, ed.), 
North-Holland. Amsterdam, 571-591. [3.P] 


James, A. T. (1954), Normal multivariate analysis and the orthogonal group, Annals of 
Mathematical Statistics, 25, 40-75. [7.2] 


James, A. T. (1964), Distributions of matrix variates and latent roots derived from 
normal samples, Annals of Mathematical Statistics, 35. 475-501. [8.6] 


REFERENCES 699 


James, W., and C. Stein (1961), Estimation with quadratic loss, Proceedings of the 
Fourth Berkeley Symposium on Mathematical Statistics and Probability (Jerzy 
Neyman, ed.), Vol. I, 361-379, University of California, Berkeley. [3.5, 7.8] 

Japanese Standards Association (1972), Statistical Tables and Formulas with Computer 
Applications. [Preface] 

Jennrich, Robert I., and Dorothy T, Thayer (1973), A note on Lawley’s formulas for 
standard errors in maximum likelihood factor analysis, Psychometrika, 38, 
571-580. [14.3] 

Johansen, §. (1995), Likelihood-based Inference in Cointegrated Vector Autoregressive 
Models, Oxford University Press. [12.7] 

Jolicoeur, Pierre, and J. E. Mosimann (1960), Size and shape variation in the painted 
turtle, a principal component analysis, Growth, 24, 339-354. Also in Benchmark 
Papers in Systematic and Evolutionary Biology (E. H. Bryant and W. R. Atchley, 
eds.), 2 (1975), 86-101. [11.P] 

Jéreskog, K. G. (1969) A general approach to confirmatory maximum likelihood 
factor analysis, Psychometrika, 34, 183-202. [14.2] 

Jéreskog, K. G., and Arthur S. Goldberger (1972), Factor analysis by generalized least 
squares, Psychometrika, 37, 243-260. [14.3] 

Kaiser, Henry F. (1958), The varimax criterion for analytic rotation in factor analysis, 
Psychometrika, 23, 187-200. [14.5] 

Kanazawa, M. (1979), The asymptotic cut-off point and comparison of error probabili- 
ties in covariate discriminant analysis, Journal of the Japan Statistical Society, 9, 
7-17. [6.6] 


Kelley, T. L. (1928), Crossroads in the Mind of Man, Stanford University, Stanford. 
[4.P, 9.P] 

Kendall, M. G., and Alan Stuart (1973), The Advanced Theory of Statistics (3rd ed.), 
Vol, 2, Charles Griffin, London. [12.6] 

Kennedy, William J., Jr., and James E. Gentle (1980), Statistical Computing, Marcel 
Dekker, New York. [12.3] 

Khatri, C. G. (1963), Joint estimation of the parameters of multivariate normal 
populations, Journal of Indian Statistical Association, 1, 125-133. [7.2] 

Khatri, C. G. (1966), A note on a large sample distribution of a transformed multiple 
correlation coefficient, Annals of the Institute of Statistical Mathematics, 18, 
375-380, [4.4] 

Khatri, C, G. (1972), On the exact finite series distribution of the smallest or the 
largest root of matrices in three situations, Journal of Multivariate Analysis, 2, 
201-207. [8.6] 

Khatri, C. G., and K. C. Sreedharan Pillai (1966), On the moments of the trace of a 
matrix and approximations to its non-central distribution, Annals of Mathematical 
Statistics, 37, 1312-1318. [8.6] 

Khatri, C. G., and K. C. S. Pillai (1968), On the non-central distributions of two test 
criteria in multivariate analysis of variance, Annals of Mathematical Statistics, 39, 
215-226. [8.6] 

Khatri, C. G., and K. V. Ramachandran (1958), Certain multivariate distribution 


problems, | (Wishart’s distribution), Journal of the Maharaja Sayajairo, University 
of Baroda, 7, 79-82. [7.2] 


700 REFERENCES 


Kiefer, J. (1957), Invariance, minimax sequential estimation, and continuous time 
processes, Annals of Mathematical Statistics, 28, 573-601. [7.8] 

Kiefer, J. (1966), Multivariate optimality results, Multivariate Analysis (Parachuyi R. 
Krishnaiah, ed.), Academic, New York, 255-274. [7.8] 

Kiefer, J., and R. Schwartz (1965), Admissible Bayes character of T?-, R*., and other 
fully invariant tests for classical multivariate normal problems, Annals of Mathe- 
matical Statistics, 36, 747-770. [5.P, 9.9, 10.10] 

Klotz, Jerome, and Joseph Putter (1969), Maximum likelihood estimation of the 
multivariate covariance components for the balanced one-way layout. Annals of 
Mathematical Statistics, 40, 1100-1105. [10.6] 


Kolmogorov, A. (1950), Foundations of the Theory of Probability, Chelsea, New York. 
[2.2] 


Konishi, Sadanori (1978a), An approximation to the distribution of the sample 
correlation coefficient, Biometrika, 65, 654-656. [4.2] 


Konishi, Sadanori (1978b), Asymptotic expansions for the distributions of statistics 
based on a correlation matrix, Canadian Journal of Statistics, 6, 49-56. [4.2] 


Konishi, Sadanori (1979), Asymptotic expansions for the distributions of functions of a 
correlation matrix, Journal of Multivariate Analysis, 9, 259-266. [4.2] 

Koopmans, T. C., and Olav Reiers¢l (1950), The identification of structural character- 
istics, Annals of Mathematical Statistics, 21, 165-181. [14.2] 

Korin, B. P. (1968), On the distribution of a statistic used for testing a covariance 
matrix, Biometrika, 55, 171-178. [10.8] 


Korin, B. P. (1969), On testing the equality of k covariance matrices, Biometrika, 56, 
216-218. [10.5] 


Kramer, K. H. (1963), Tables for constructing confidence limits on the multiple 
correlation coefficient, Journal of the American Statistical Association, 58, 
1082-1085. [4.4] 


Krishnaiah, P. R. (1978), Some recent developments on real multivariate distributions, 
Development in Statistics (P. R. Krishnaiah, ed.), Vol. 1, Academic, New York, 
135-169. [8.6] 


Krishnaiah, P. R. (1980), Computations of some multivariate distributions, Analysis of 
Variance, Handbook of Statistics, Vol. 1 (P. R. Krishnaiah, ed.), North-Holland, 
Amsterdam, 745-971. 


Krishnaiah, P. R., and T. C. Chang (1972), On the exact distributions of the traces of 
S\(S,+S)~! and S,S;', Sankhya, A, 34, 153-160. [8.6] 

Krishnaiah, P. R., and F. J. Schuurmann (1974), On the evaluation of some distribu- 
tions that arise in simultaneous tests for the equality of the latent roots of the 
covariance matrix, Journal of Multivariate Analysis, 4, 265~282. [10.7] 

Kshirsagar, A. M. (1959), Bartlett decomposition and Wishart distribution, Annals of 
Mathematical Statistics, 30, 239-241. [7.2] 

Kudo, H..(1955), On minimax invariant estimates of the transformation parameter, 
Natural Science Report, 6, 31-73, Ochanomizu University, Tokyo, Japan. [7.8] 

Kunitomo, Naoto (1980), Asymptotic expansions of the distributions of estimators in a 


linear functional relationship and simultaneous equations, Journal of the Ameri- 
can Statistical Association, 75, 693—700. [12.7] 


REFERENCES 701 


Lachenbruch, P. A., and M. R. Mickey (1968), Estimation of error rates in discrimi- 
nant analysis, Technomerrics, 10, 1-11. [6.6] 

Laplace, P. S. (1811), Mémoire sur les intégrales définies et leur application aux 
probabilités, Mémoires de [Institut Impérial de France, Année 1810, 279-347. [1.2] 

Lauritzen, Steffen L. (1996), Graphical Models, Clarendon Press, Oxford. [15.5] 

Lauritzen, Steffen L., and N. Wermuth (1989), Graphical models for associations 
between variables some of which are qualitative and some quantitative, Annals of 
Statistics, 17, 31-57. [15.2] 

Lauter, Jiirgen, Ekkehard Glimm, and Siegfried Kropf (1996a), New multivariate tests 
for data with an inherent structure, Biomerrics Journal, 38, 5-23. (Correction: 40. 
(1998), 1015.] [5.7] 

Lauter, Jirgen, Ekkehard Glimm, and Siegfried Kropf (1996b), Multivariate tests 
based on left-spherically distributed linear scores, Annals of Statistics. 26. 
1972-1988. [5.7] 

Linter, Jirgen, Exkehard Glimm, and Siegfried Kropf (1996c). Exact stable multivari- 
ate tests for applications in clinical research. ASA Proceedings of the Biopharma- 
ceutical Section, 46-55. [5.7] 

Lawley, D. N. (1938), A generalization of Fisher's z test, Biometrika, 30, 180-187. 
[8.6] 

Lawley, 1D. N. (1940), The estimation of factor loadings by the method of maximum 
likelihood, Proceedings of the Royal Society of Edinburgh, Sec. A, 60, 64-82. [14.3] 

Lawley, D. N. (1941), Further investigations in factor estimation, Proceedings of the 
Royal Society of Edinburgh, Sec. A, 61, !76—-185. [14.4] 

Lawley, D. N. (1953), A modified method of estimation in factor analysis and some 
large sample results, Uppsala Symposium on Psychological Factor Analysis. 17--19 
March 1953, Uppsala, Almqvist and Wiksell, 35-42. [14.3] 

Lawley, D. N. (1958), Estimation in factor analysis under various initial assumptions. 
British Journal of Statistical Psychology, 11, 1-12. [14.2, 14.6] 

Lawley, D. N. (1959), Tests of significance in canonical analysis, Biometrika. 46. 
59-66. [12.4] 

Lawley, D. N. (1967), Some new results in maximum likelihood factor analysis. 
Proceedings of the Royal Society of Edinburgh, Sec. A, 87. 256-264. [14.3] 

Lawley, D. N., and A. E. Maxwell (1971), Factor Analysis as a Statistical Method (2nd 
ed.), American Elsevier, New York. [14.3, 14.5] 

Lee, Y. 8. (1971a), Asymptotic formulae for the distribution of a multivariate test 
statistic: power comparisons of certain multivariate tests. Biomeutka. 38, 647-651. 
[8.6] 

Lee, Y. S. (1971b), Some results on the sampling distribution of the multiple 
correlation coefficient, Journal of the Royal Statistical Society B, 33. 117-130. [4.4] 

Lee, Y. S. (1972), Tables of upper percentage points of the multiple correlation 
coefficient, Biometrika, 59, 175-189. [4.4] 

Lehmann, E. L. (1959), Testing Statistical Hypotheses, John Wiley & Sons, New York. 
(4.2, 5.6] 

Lehmer, Emma (1944), Inverse tables of probabilities of errors of the second kind, 
Annals of Mathemauical Statistics, 18, 388-398. [5.4] 


702 REFERENCES 


Loéve. M. (1977). Probability Theory I (4th ed.), Springer-Verlag, New York. [2.2] 
Loeve, M. (1978), Probability Theory I (4th ed.), Springer-Verlag, New York. [2.2] 


Madow. W. G. (1938), Contributions to the theory of multivariate statistical analysis, 
Transactions of the American Mathematical Society, 44, 454-495. [7.2] 


Magnus, Jan R. (1988), Linear Structures, Charles Griffin and Co., London. [A.4] 


Muenus, J. R., and H. Neudecker (1979), The commutation matrix: some properties 
and applications, The Annals of Statistics, 7, 381-394, [3.6] 


Mahalanobis, P. C. (1930), On tests and measures of group divergence, Journal and 
Proceedings of the Asiatic Society of Bengal, 26, 541-588, [3.3] 


Mahalanobis, P. C., R. C. Bose, and S. N. Roy (1937), Normalisation of statistical 
variates and the use of rectangular co-ordinates in the theory of sampling 
distributions, Sankhya, 3, 1-40, [7.2] 


Mallows, C. L. (1961), Latent vectors of random symmetric matrices, Biometrika, 48, 
133-149, [11.6] 


Mardia, K. V. (1970), Measures of multivariate skewness and kurtosis with applica- 
tions, Biometrika, 57, 519-530, [3.6] 


Mariano, Roberto S., and Takamitsu Sawa (1972), The exact finite-sample distribu- 
tion of the limited-information maximum likelihood estimator in the case of two 


included endogenous variables, Journal of the Americar: Statistical Association, 67, 
159-163. [12.7] 


Maronna, R. A. (1976), Robust M-estimators of multivariate location and scatter, 
Annals of Statistics, 4, 51-67. [3.6] 


Marshall, A. W., and 1, Olkin (1979), Inequalities: Theory of Majorization and Its 
Applications, Academic, New York. [8.10] 


Mathai, A. M. (1971), On the distribution of the likelihood ratio criterion for testing 
linear hypotheses on regression coefficients, Annals of the Institute of Statistical 
Mathematics, 23, 181-197. [8.4] 


Mathai, A. M., and R. S. Katiyar (1979), Exact percentage points for testing indepen- 
dence, Biometrika, 66, 353-356. [9.3] 


Mathai, A. M., and P. N. Rathie (1980), The exact non-null distribution for testing 
equality of covariance matrices, Sankhya A, 42, 78-87. [10.4] 


Mathai, A. M.. and R. K. Saxena (1973), Generalized Hypergeometric Functions with 
Applications in Statistics and Physical Sciences, Lecture Notes No, 348, Springer- 
Verlag, New York. [9.3] 


Mauchly. J. W. (1940). Significance test for sphericity of a normal n-variate distribu- 
tion, Annals of Mathematical Statistics, 11, 204-209, [10.7] 


Mauldon, J. G. (1955), Pivotal quantitics for Wishart’s and related distributions, and a 
paradox in fiducial theory, Journal of the Royal Statistical Society B, 17, 79-85, 
[7.2] 

McDonald. Roderick P, (2002), What can we learn from path equations: identifiabil- 
ity, constraints, equivalence, Psychometrika, 67, 225-249, [15.1] 

McLachlan, G. J. (1973), An asymptotic expansion of the expectation of the estimated 


error rate in discriminant analysis, Australian Journal of Statistics, 15, 210-214. 
[6.6] 


REFERENCES 703 


MeLachlan, G, J, (1974a), An asymptotic unbiased technique tor estimating the error 
rates in discriminant analysis. Biometrics. 30. 239-249. [6.6] 

McLachlan. G. J. (1974b). Estimation of the errots of misclissificuciun on the 
criterion of asymptotic mean square error, Technomerrics, 16, 255-260. [6.6] 
McLachlan, G, J. (1974c), The asymptotic distributions of the conditional error rate 

and risk in discriminant analysis, Biometrika, 61, 131-135. [6.6] 

McLachlan, G. J. (1977), Constrained sample discrimination with the studentized 
classification statistic W, Communications in Statistics--Theory and Methods, A6, 
575-583 [6.6] 

Memon, A. Z., and M. Okamoto (1971), Asymptotic expansion of the distribution of 
The Z statistic in discriminant analysis, Journal of Multivariate Analysis, 1, 
294-307. [6.4] 

Mijares, T. A. (1964), Percentage Points of the Sum V{& of s Roots (s =1—50), The 
Statistical Center, University of the Philippines, Manila, [8,6] 

Mikhail, N. N. (1965), A comparison of tests of the Wilks-Lawley hypothesis in 
multivariate analysis, Biometrika, 52, 149-156, [8.6] 

Mood, A. M. (1951), On the distribution of the characteristic roots of normal 
second-moment matrices, Annals of Mathematical Statistics, 22, 266-273. [13.2] 

Morris, Blair, and Ingram Olkin (1964), Some estimation and testing problems for 
factor analysis models, unpublished. [10.6] 

Mudholkar, G. S. (1966), On confidence bounds associated with multivariate analysis 
of variance and non-independence between two sets of variates, Annals of 
Mathematical Statistics, 37, 1736-1746. [8.7] 

Mudholkar, Govind §., and Madhusudan C. Trivedi (1980), A normal approximation 
for the distribution of the likelihood ratio statistic in multivariate analysis of 
variance, Biometrika, 67, 485-488, [8.5] 

Mudholkar, Govind S., and Madhusudan C, Trivedi (1981), A normal approximation 
for the multivariate likelihood ratio statistics, Statistical Distributions in Scientific 
Work (C. Taillie et al., eds.), Vol. 5, 219-230, D. Reidel Publishing. [8.5] 

Muirhead, R. J. (1970), Asymptotic distributions of some multivariate tests, Annals of 
Mathematical Statistics, 41, 1002-1010. [8.6] 

Muirhead, R, J. (1980), The efiects of elliptical distributions on some standard 
procedures involving correlation coefficients: a review, Multivariate Statistical 
Analysis (R. P. Gupta, ed.), 143-159. [4,5] 

Muirhead, Robb J. (1982), Aspects of Multivariate Statistical Theory, John Wiley and 
Sons, New York, [2.7, 3.6, 7.7] 

Muirhead, R. J., and C. M. Waternaux (1980), Asymptotic distributions in canonical 
correlation analysis and other multivariate procedures for nonnormal popula- 
tions, Biometrika, 67, 31-43. [4.5] 

Nagao, Hisao (1973a), On some test criteria for covariance matrix, Annals of Statistics, 
1, 700-709. [9.5, 10.2, 10.7, 10.8] 

Nagao, Hisao (1973b), Asymptotic expansions of the distributions of Bartlett's test and 
sphericity test under the local alternatives, Annals of the Institute of Statistical 
Mathematics, 25, 407-422, [10.5, 10.6] 

Nagao, Hisao (1973c), Nonnull distributions of two test criteria for incependence 
under local alternatives, Journal of Multivariate Analysis, 3, 435-444. [9.4] 


704 REFERENCES 


Nagarsenker, B. N., and K. C. S. Pillai (1972), The Distribution of the Sphericity Test 
Criterion, ARL 72-0154, Aerospace Research Laboratories, [Preface] 

Nagarsenker, B. N., and K. C. S. Pillai (1973a), The distribution of the sphericity test 
criterion, Journal of Multivariate Analysis, 3, 226-235, [10.7] 

Nagarsenker, B. N., and K, C, S. Pillai (1973b), Distribution of the likelihood ratio 
criterion for testing a hypothesis specifying a covariance matrix, Biometrika, 60, 
359-364. [10.8] 

Nagarsenker, B. N., and K. C, §. Pillai (1974), Distribution of the likelihood ratio 
criterion for testing 2 = Zo, = Ho, Journal of Multivari: te Analysis, 4, 114-122. 
[10.9] 

Nanda, D, N, (1948), Distribution of a root of a determinantal equation, Annals of 
Mathematical Statistics, 19, 47-57. [8.6] 

Nanda, D. N. (1950), Distribution of the sum of roots of a determinantal equation 
under a certain condition, Annals of Mathematical Statistics, 21, 432—439. [8.6] 

Nanda, D, N, (1951), Probability distribution tables of the larger root of a determinan- 
tal equation with two roots, Jounal of the Indian Society of Agricitcural Statistics, 
3, 175-177. [8.6] 

Narain, R. D, (1948), A new approach to samplirg distributions of the multivariate 
norinal theory, 1, Journal of tre Indian Soctety of Agricultural Statistics, 1, 59-69, 
[7.2] 

Narain, R. D, (1950), On the completely unbiased character of tests of independence 
in multivariate normal systems, Annals of Mathematical Statistics, 21, 293-298, 
[9,2] 

National Bureau of Standards, United States (1959), Tables of the Bivariate Normal 
Distribution Function and Related Functions, U.S, Government Printing Office, 
Washington, D.C. [2.3] 

Neveu, Jacques (1965), Mathematical Foundations of the Calculus of Probability, 
Holden-Day, San Francisco. [2.2] 

Ogawa, J. (1953), On the sampling distributions of classical statistics in multivariate 
analysis, Osaka Mathematics Journal, 5, 13-52. [7.2] 

Okamoto, Masashi (1963), An asymptotic expansion for the distribution of the linear 
discriminant function, Annals of Mathematical Statistics, 34, 1286-1301, (Correc- 
tion, 39 (1968), 1358-1359.) [6.6] 

Okamoto, Masashi (1973), Distinctness of the eigenvalues of a quadratic form in a 
multivariate sample, Annals of Statistics, 1, 763-765, [13.2] 

Olkin, Ingram, and S, N. Roy (1954), On multivariate distribution theory, Annals of 
Mathematical Statistics, 25, 329-339, [7.2] 

Olson, C. L. (1974), Comparative robustness of six tests in multivariate analysis of 
variance, Journal of the American Statistical Association, 69, 894-908, [8:6] 

Pearl, Judea (2000), Causality: Models, Reasoning, and Inference, Cambridge Univer- 
sity Press, Cambridge. [15.1] 

Pearson, E. S,, and H. O, Hartley (1972), Biometrika Tables for Statisticians, Vol. Il, 
Cambridge (England), Published for the Biometrika Trustees at the University 
Press, [Preface, 8.4] 

Pearson, E. S,, and S. S, Wilks (1933), Methods of statistical analysis appropriate for k 
samples of two variables, Biometrika, 25, 353-378. [10.5] 


REFERENCES 705 


Pearson, K. (1896), Mathematical contributions to the theory of evelution—~11I. 
Regression, heredity and panmixia, Philosophical Transactions of the Royal Society 
of London, Series A, 187, 253-318, [2.5, 3.2] 

Pearson, K. (1900), On the criterion that a given system of deviations from the 
probable in the case of a correlated system of variables is such that it can be 
reasonab'y supposed to have arisen from random sampling, Philosophical Maga- 
zine, 50 (fifth series), 157-175, [3.3] 

Pearson, K. (1901), On lines and planes of closest fit to systems of points in space, 
Philosophical Magazine, 2 (sixth series), 559-572. [11.2] 

Pearson, K, (1930), Tables for Statisticians and Biometricians, Part 1 (3rd ed,), Cam- 
bridge University, Cambridge. [2.3] 

Pearson, K. (1931), Tables for Statisticians and Biomerricians, Part ii, Cambridge 
University, Cambridge, [2,3, 6.8] 

Perlman, M. D, (1980), Unbiasedness of the likelihood ratio tests for equality of 
several covariance matrices and equality of several multivariate normal popula- 
tions, Annals of Statistics, 8, 247-263. [10.2, 10,3] 

Perlman, M. D., and 1. Olkin (1980), Unbiasedness of invariant tests for MANOVA 
and other multivariate problems, Annals of Statistics, 8, 1326-1341. [8.10] 

Phillips, P, C, B, (1980), The exuct distribution of instrumental variable estimators in 
an equation containing n+ 1 endogenous variables, Econometrica, 48, 861-878, 
[12,7] 

Phillips, P, C, B. (1982), A new approach to small sample theory, unpublished, Cowles 
Foundation for Research in Economics, Yale University. [12.7] 

Pillai, K. C. S, (1954), On some distribution problems in multivariate analysis, Mimeo 
Series No. 54, Institute of Statistics, University of North Carolina, Chapel Hill, 
North Carolina. [8.6] 

Pillai, K. C. S, (1955), Some new test criteria in multivariate analysis, Annals of 
Mathematical Statistics, 26, 117-121. [8.6] 

Pillai, K. C. S, (1956), On the distribution of the largest or the smallest root of a 
matrix in multivariate analysis, Biometika, 43, 122-127, [8.6] 

Pillai, K. C. S. (1960), Statistical Tables for Tests of Multivariate Aypotheses, Statistical 
Center, University of the Philippines. Manila, [8.6] 

Pillai, K. C. S, (1964), On the moments of elementary symmetric functions of the roots 
of two matrices, Annals of Mathematical Statistics, 38, 1704-1712, [8.6] 

Pillai, K. C. S. (1965), Ou the distribution of the largest characteristic root of a matrix 
in multivariate analysis, Biomerika, 52, 405-412. [8.6] 

Pillai, K, C. S. (1967), Upper percentage points of the largest root of a matrix in 
multivarjate analysis, Biometroka, 54, 189-194, [8.6] 

Pillai, K. C, S., and A, K, Gupta (1969), On the exact distribution of Wilks’ criterion. 
Biometrika, 56, 109-118. [8.4] 

Pillai, K. C, S., and Y. S. Hsu (1979), Exact robustness studies of the test of 
independence based on four multivariate criteria and their distribution problems 
under violations, Amnils of ihe Institute of Statistical Mathematics, 3). Part A. 
85~101, [8.6] 

Pillai, K. C. S., and K. Jayachandran (1967), Power comparisons of tests of two 
multivariate hypotheses based on four criteria, Biomeurtka, 54, 195-210. [8.6] 


706 REFERENCES 


Pillai, K. C, S., and K, Jayachandran (1970), On the exact distribution of Pillai’s V“ 
eTiterion, Journal of the American Statistical Association, 65, 447-454. [8.6] 

Pillai. K. C, S,, and T. A, Mijares (1959), On the moments of the trace of a matrix and 
approximations to its distribution, Annals of Mathematical Statistics, 30, 
1135-1140. [8.6] 

Pillai, K. C, S., and B, N. Nagarsenker (1971), On the distribution of the spherictty 
test criterion in classical and complex normal populations having unknown 
covariance matrices, Annals of Mathematical Statistics, 42, 764-767. [10.7] 

Pillai, K, C, §,, and P, Samson, Jr, (1959), On Hotelling’s generalization of T*, 
Biometrika, 46. 160-168. [8.6] 


Pillai. K. C. S.. and T. Sugiyama (1969), Non-central distributions of the largest latent 
roots of three matrices in multivariate analysis, Annals of the Institute of Statistical 
Mathernalics, 21, 321-327, [8.6] 


Pillai, K, C. S.. and D, L. Young (1971), On the exact distribution of Hotelling's 
generalized T;, Journal of Multivariate Analysis, 1, 90-107. [8.6] 

Plana, G. A. A. (1813), Mémoire sur divers problémes de probabilité, Mémoires de 
[Académie Impériale de Turin, pour les Années 1811-1812, 20, 355-408. [1.2] 
Polya, G, (1949), Remarks on computing the probability integral in one and two 
dimensions, Proceedings of the Berkeley Symposium on Mathematical Statistics and 

Probability WI. Neyman, ed.), 63-78. [2.P] 

Rao, C, R. (1948a), The utilization of multiple measurentents in problems of biologi- 
cal classification, Journal of the Royal Sttistical Society B, 10, 159-193. [6.9] 
Rao. C. R. (1948b). Tests of significance in multivariate analysis, Bromerrika, 35, 

58-79, [5.3] 
Rao, C, Radhakrishna (1951), An asymptotic expansion of the distribution of Wilks’s 
criterion, Bulletin of the international Statistical Institute, 33, Part 2, 177-180, [8.5] 


Rao. C, R. (1952), Advanced Statistical Methods in Biometric Research, John Wiley & 
Sons. New York. [12.5] 


Rao. C. R. (1973), Linear Statistical Inference and Its Applications (2nd ed.), John 
Wiley & Sons. New York. [4.2] 


Rasch, G. (1948), A functional equation for Wishart’s distribution, Annals of Mathe- 
matical Statistics, 19, 262-266. [7.2] 


Reiersgl, Olav (1950), On the identifiability of parameters in Thurstone’s multiple 
factor aualysis. Psychometrika, 15, 121-149, [14,2] 

Reinsel, G, C.. and R. P, Velu (1998), Multivariate Reduced-rank Regression, Springer, 
New York. [12.7] 


Richardson. D. H. (1968), The exact distribution of a structural coefficient estimator, 
Journal of tite American Statistical Association, 63, 1214-1226. [12.7] 


Rothenberg. Thomas J. (1977), Edgeworth expansions for multivariate test statistics, 
1P-255, Center for Research in Management Science, University of California, 
Berkeley, [8.6] 

Roy, S. N. (1939), p-statistics or some gencralisations in analysis of variance appropri- 
ale to multivariate problems, Sankhya, 4, 381-396. [13,2] 

Roy. S. N. (1945), The individual sampling distribution of the maximum, the minimum 


and any intermediate of the p-statistics on the null-hypothesis, Sankhya, 7, 
133-158, [8.6] 


REFERENCES 707 


Roy, S. N. (1953), On a heuristic method of test construction and its use in 
multivariate analysis, Annals of Mathematical Statistics, 24, 220-238. [8.6, 10.6] 

Roy, S. N. (1957), Some Aspects of Multivariate Analysis, John Wiley & Sons, New 
York, [8.6, 10.6, 10.8] 

Ruben, Harold (1966), Some new -esults on the distribution of the sample correlation 
coefficient, Journal of the Royal Statistical Society B, 28, 513-525. [4.2] 

Rubin, Donald B., and Dorothy T. Thayer (1982), EM algorithms for ML factor 
analysis, Psychometrika, 47, 69-76, [14.3] 

Ryan, D, A. J., J. J. Hubert, E. M. Carter, J, B. Sprague, and J, Parrot (1992), A 
reduced-rank multivariate regression approach to joint toxicity experiments, 
Biometrics, 48, 155-162, [12.7] 

Salaevskit, ©. V. (1968), The minimax character of Hotelling’s T? test (Russian), 
Doklady Akademii Nauk SSSR, 180, 1048-1050. [5.6] 

Salaevskil, (Shalaevskii), O, V. (1971), Minimax character of Hotelling’s T? test. I. 
Investigations in Classical Problems of Probability Theory and Mathematical Statis- 
tics, V. M. Kalinin and O, V, Shalaevskii (Seminar in Mathematics, V. I. Steklov 
Institute, Leningrad, Vol. 13), Consultants Bureau, New York. [5.6] 

Sargan, J. D., and W. M. Mikhail (1971), A general approximation to the distribution 
of instrumental variables estimates, Econometrica, 39, 131-169. [12.7] 

Sawa, Takamiitsu (1969), The exact sampling distribution of ordinary least squires and 
two-stage least squares estimators, Jorunal of the American Statistical Association, 
64, 923-937, [12.7] 

Schatzoff, M. (1966a), Exact distributions of Wilks’s likelihood ratio criterion, 
Biometrika, 53, 347-358, [8,4] 

Schatzoff, M. (1966b), Sensitivity comparisons among tests of the general linear 
hypothesis, Journal of the American Statistical Association, 61, 415-435, [8.6] 
Scheffé, Henry (1942), On the ratio of the variances of two normal populations, 

Annals of Mathematical Statistics, 13, 371-388. [10.2] 

Scheffé, Henry (1943). On solutions of the Behrens—Fisher problem, based on the 
t-distribution, Annals of Mathematical Statistics, 14, 35-44, [5.5] 

Schmidli, H, (1996), Reduced-rank Regression, Physica, Berlin. [12.7] 

Schuurmann, F, J,, P. R. Krishnaiah, and A. K, Chattopadhyay (1975), Exact percent- 
age points of the distribution of the trace of a multivariate beta matrix, Journal of 
Statistical Computation and Simulation, 3, 331-343. [8.6] 

Schuurmann, F, J,, and V. B. Waikar (1973), Tables for the power function of Roy's 
two-sided test for testing hypothesis © =J in the bivariate case, Communications 
in Statistics, 1, 271-280. [10.8] 

Schuurmann, F, J., V, B. Waikar, and P, R, Krishnaiah (1975), Percentage points of 
the joint distribution of the extreme roots of the random matrix S,(S,+5,)7/, 
Journal of Statistical Computation and Simulation, 2, 17-38. [10.6] 

Schwartz, R. (1967), Admissible tests in multivariate analysis of variance, Annals of 
Mathematical Statistics, 38, 698-710. [8.10] 

Serfling, Robert J. (1980), Approximation Theorems of Mathematical Statistics, John 
Wiley & Sons, New York, [4.2] 


Simaika, J. B, (1941), On an optimum property of two important statistical tests, 
Biometrika, 32, 70-80, [5.6] 


708 REFERENCES 


Siotani, Minoru (1980), Asymptotic approximations to the conditional distributions of 
the classification statistic Z and its studentized form Z*, Tamkang Journal of 
Mathematics, 11, 19-32. [6.6] 

Siotani, M., and R, H, Wang (1975), Further expansion formulae for error rates and 
comparison of the W- and the Z-procedures in discriminant analysis, Technical 
Report No, 33, Department of Statistics, Kansas State University, Manhattan, 
Kansas, [6.6] 


Siotani, M., and R. H. Wang (1977), Asymptotic Expansions for Error Rates and 
Comparison of the W-Procedure and the Z-Procedure in Discriminant Analysis, 
Multivariate Analysis IV, North-Holland, Amsterdam, 523-545. [6.6] 

Sitgreaves, Rosedith (1952), On the distribution of two random matrices used in 
classification procedures, Annals of Mathematical Statistics, 23, 263-270, [6.5] 
Solari, M. E, (1969), The “maximum likelihood solution” of the problem of estimating 
a linear functional relationship, Journal of the Royal Statistical Society B, 31, 

372-375, [14.4] 

Spearman, Charles (1904), “General-intelligence,” objectively determined and mea- 
sured, American Journal of Psychology, 15, 201-293. [14.2] 

Speed, T. P., and H. Kiiveri (1986), Gaussian Markov distributions over finite graphs, 
Annals of Statistics, 14, 138-150. [15.5] 

Srivastava, M. S., and C, G. Khatri (1979), An Introduction to Multivariate Statistics, 
North-Holland, New York, [10.9, 13.P] 


Stein, C, (1956a), The admissibility of Hotelling’s T?-test, Annals of Mathematical 
Statistics, 27, 616-623, [5,6] 

Stein, C. (1956b), Inadmissibility of the usual estimator for the mean of a tnultivariate 
normal distribution, Proceedings of the Third Berkeley Symposium on Mathematical 
and Statistical Probability (Jerzy Neyman, ed.), Vol, I, 197-206, University of 
California, Berkeley, [3.5] 

Stein, C, (1974), Estimation of the parameters of a multivariate normal distribution. I, 
Estimation of the means, Technical Report No. 63, NSF Contract GP 30711X-2, 
Department of Statistics, Stanford University. [3,5] 


Stoica, P., and M. Viberg (1996), Maximum likelihood parameter and rank estimation 
in reduced-rank multivariate linear regressions, /EEE Transaction Signal Process- 
ing, 44, 3069-3078. [12.7] 

Student (W, S, Gosset) (1908), The probable error of a mean, Biometrika, 6, 1-25, 
[3.2] 


Styan, George P. H. (1990), The Collected Papers of T. W. Anderson: 1943-1985, John 
Wiley & Sons, Inc., New York. [Preface] 

Subrahmaniam, Kocherlota, and Kathleen Subrahmaniam (1973), Multivariate Analy- 
sis: A Selected and Abstracted Bibliography, 1957-1972, Marcel Dekker, New York, 
[Preface] 

Sugiura, Nariaki, and Hisao Nagao (1968), Unbiasedness of some test criteria for the 
equality of one or two covariance matrices, Annals of Mathematical Statistics, 39, 
1686-1692. [10.8] 

Sugiyama, T. (1967), Distribution of the largest latent root and the smallest latent 


toot of the generalized 8 statistic and F statistic in multivariate analysis, Annals 
of Mathematical Statistics, 38, 1152-1159. [8.6] 


REFERENCES 709 


Sugiyama, T., and K, Fukutomi (1966), On the distribution of the extreme characteris- 
tic roots of the matrices in multivariate analysis, Reports of Statistical Application 
Research, Union of Japanese Scientists and Engineers, 13. [8,6] 

Sverdrup, Erling (1947), Derivation of the Wishart distribution of the second order 
sample moments by straightforward integration of a multiple integral, Skandi- 
navisk Aktuarietidskrift, 30, 151-166. [7.2] 

Tang, P, C, (1938), The power function of the analysis of variance tests with tables 
and illustrations of their use, Statistical Resecrch Memoirs, 2, 126-157. [5.4] 
Theil, H, (assisted by J, S, Cramer, H. Moerman, and A, Russchen) (1961), Economtic 
Forecasts and Policy, (2nd rev. ed,), North-Holland, Amsterdam, Contributions to 

Economic Analysis No. XV (first published 1958), [12.8] 

Thomson, Godfrey H. (193-.), Hotelling’s method modified to give Spearman's “g,” 
Journal of Educational I sychology, 25, 366-374, [14,3] 

Thomson, Godfrey H. (1951), The Factorial Analysis of Human Ability (5th ed.). 
University of London, London, [14.7] 

Thurstone, L, L. (1947), Multiple-Factor Analysis, University of Chicago, Chicago. 
[14.2, 14,5] 

Tsay, R. S,, and G. C. Tiao (1985), Use of canonical analysis in time series model 
identification, Biometrika, 72, 299-315. [12.7] 

Tukey, J. W, (1949), Dyadic anova, an analysis of variance for vectors. Hitman Biology, 
21, 65-110. [8.9] 

Tyler, David E, (1981), Asymptotic inference for cigenvectors, Annals of Statistics, 9, 
725-745, [11.7] 

Tyler, David E. (1982), Radial estimates and the test for sphericity, Biomerriku. 69. 
429-436, [3.6] 

Tyler, David E. (1983a); Robustness and efficiency properties of scatter matrices, 
Biometrika, 70, 411-420. [3.6] 

Tyler, David E, (1983b), The asymptotic distribution of principal component roots 
under local alternatives to multiple roots, Annals of Statistics, WL. 1232-1242. 
[11.7] 

Tyler, David E, (1987), A distribution free M-estimator of multivariate scatter, Annals 
of Statistics, 15, 234-251. [3.6] 

Velu, R, P,, G. C, Reinsel, and D, W. Wichern (1986), Reduced rank models for 
multiple time series, Biometrika, 73, 105-118, [12,7] 

von Mises, R, (1945), On the classification of observation data into distinct groups. 
Annals of Mathematical Statistics, 16, 68-73. [6.8] 

von Neumann, J. (1937), Some matrix-inequalities and metrization of matric-space. 
Tomsk University Review, 1 286-300. Reprinted in Jofin von Newman Collected 
Works (A. H. Taub, ed.), 4 (1962), Pergamon. New York. 205-219, [A.4] 

Wald, A, (1943), Tests of statistical hypotheses concerning several parameters when 
the number of observations is large, Transactions of the American Mathematical 
Society, $4, 426-482, [4,2] 

Wald, A. (1944), On a statistical problem arising in the classification of an individual 
into one of two groups. Annals of Mathematical Statistics. 15, 145-162. (6.4, 6.5] 

Wald, A. (1950), Statistical Decision Functions, John Wiley & Sons, New York. [6.2. 
6.7, 8.10] 


710 REFERENCES 


Wald, A., and R. Brookner (1941), On the distribution of Wilks’ statistic for testing 
the independence of several groups of variates, Annals of Mathematical Statistics, 
12, 137-152. [8.4, 9,3] , 

Walker, Helen M, (1931), Studies in the History of Statistical Method, Williams and 
Wilkins, Baltimore. [1.1] 


Welch, P. D., and R. S, Wimpress (1961), Two multivariate statistical c »mputer 
programs and their application to the vowel recognition problem, Journal of the 
Acoustical Society of America, 33, 426-434, [6.10] 

Wermuth, N. (1980), Linear recursive equations, covariance selection and path 
analysis, Journal o, the American Statistical Association, 75, 963-972. [15,5] 

Whittaker, E. T., and G. N. Watson (1943), A Course of Modern Analysis, Cambridge 
University, Cambridge, [8.5] 

Whittaker, Joe (1990), Graphical Models in Applied Multivariate Statistics, John Wiley 
& Sons, In¢e.. Chichester, [15,1] 

Wijsman, Robert A, (1979), Constructing all smallest simultaneous confidence sets in 
a given class, with applications to MANOVA, Annals of Statistics, 7, 1003-1018. 
[3.7] 

Wijsman, Robert A. (1980), Smallest simultaneous confidence sets with applications 
in multivariate analysis. Multivariate Analysis, V, 483-498. [8.7] 

Wilkinson, James Hardy (1965), The Algebraic Eigenvalue Problem, Clarendon, Oxford. 
[11,4] 

Wilkinson, J. H,, and C. Reinsch (1971), Linear Algebra, Springer-Verlag, New York, 
[11.4] 

Wilks, S, S, (1932), Certain generalizations in the analysis of variance, Biometrika, 24, 
471-494. [7.5, 8,3, 10,4] 

Wilks, S. S. (1934), Moment-generating operators for determinants of product mo- 
ments in samples from a normal system, Annuals of Mathematics, 35, 312-340. 
[8.3] 

Wilks, S. S. (1935), On the independence of k sets of normally distributed statistical 
variables, Econonretrica, 3, 309-326. [8.4, 9.3, 9,P] 

Wishart, John (1928), The generalised product moment distribution in samples from a 
normal multivariate population, Biometrika, 20A, 32-52. [7.2] 

Wishart, John (1948), Proofs of the distribution law of the second order inoment 
statistics, Biometrika, 35, 55-57, [7.2] 

Wishart. John. and Mz S. Bartlett (1933), The generalised product moment distribu- 
tion in a normal system, Proceedings of the Cambridge Philosophical Society, 29, 
260-270. [7.2] 

Wold, H. D. A. (1954), Causality and econometrics, Econom trica 22, 162-177, [15,11 

Wold, H. D. A. (1960), A generalization of casual chain models, Econometrica, 28, 
443-463. [15.1] 

Waltz, W. G., W. A. Reid, and W. E. Colwell (1948), Sugar and nicotine in cured 
bright tobacco as related to mineral element composition, Proceedings of the Soil 
Sciences Society of America, 13, 385-387. [8.P] 


Wright, Sewall (1921), Correlation and causation, Journal of Agricultural Research, 20, 
557-585, [15.1] 


REFERENCES 711 


Wright, Sewall (1934), The method of path coefficients, Annals of Mathematical 
Statistics, 5, 161-215. [15.1] 

Yamauti, Ziro (1977), Concise Statistical Tables, Japanese Standards Association. 
[Preface] 

Yates, F., and W. G. Cochran (1938), The analysis of groups of experiments, Journal 
of Agricultural Science, 28, 556. [8.9] 

Yule, G. U. (1897a), On the significance of Bravais’ formulae for regression & c., in 
the case of skew correlation, Proceedings of the Royal Society of London, 60, 
477-489, [2.5] 

Yule, G. U, (1897b), On the theory of correlation, Journal of the Royal Statistical 
Society, 60, 812-854. [2.5] 

Zehna, P, W, (1966), Invariance of maximum likelihood estimators, Annals of Mathe- 
matical Statistics, 37, 744. [3.2] 


Index 


Absolutely continuous distribution, 8 
Additivity of Wishart matrices, 259 
Admissible, definition of, 88, 210, 235 
Admissible procedures, 235 
Admissible test, deftnition of, 193 

Stein's theorem for, 194 
Almost invarjant test, 192 
Analysis of variance, random effects model, 

429 

likelihood ratio test of, 431 

See also Multivariate analysis of variance 
Anderson, Edgar, Iris data of, 111 
A posterior density, 89 

of yw, &9 

of yt and X, 274 

of wt, given ¥ and S, 275 

of X&, 273 
Asymptotic distribution of a function, 132 
Asymptotic expansions of distributions of 

likelihood ratio criteria, 321 
of gamma function, 318 


Barley yields in two years, 349 
Bartlett decomposition, 257 
Bartlett-Nanda-Pillai trace criterion, see 
Linear hypothesis 
Bayes estimator of covariance matrix, 273 
Bayes estimator of mean vector, 90 
Bayes procedure, 89 
extended, 90 
Bayes risk, 89 
Bernoulli polynomials, 318 
Best estimator of covariance matrix 
invariant with respect to triangular linear 
transformations, 279, 281 
proportional to sample covariance matrix. 
277 


Best linear predictor, 37, 497 

and predictand, 497 

See also Canonical correlations and variates 
Best linear unbiased estimator. 298 
Beta distribution, 177 
Bhattacharya’s estimator of the mean, 99 
Bivariate normal density, 21, 35 

distribution, 21. 35 

computation of, 23 

Bootstrap method, 135 


(17 
Canonical unalysis of regression cvefficients. 
508 
sample. S10 
Canonical correlations «nd variates. 487, 495 
asymplolic distribution of sample 
correlaions. 505 
computation of, 50] 
distribution of sample, 545 
invariance of, 496 
maximum likelihood estimators of, 501 
sample. 500 
testing number of nonzero correlations, 504 
use in testing hypotheses of rank of 
covariance matrix, 504 
CauSal chain. 605 
Central limit theorem. multivariate. 86 
Characteristic function, 41 
continuity thearem for, 45 
inversion of, 45 
of the multivariate normal distribution. 43 
Characteristic roots and vectors, 631. 632 
asympiotic distributions of, 545, 559 
distribution of roats of a symmetric matrix, 
542 
Uistribulion of roots of Wishart matrix, 540 


713 


714 


Characteristic roots and vector (Continued) 
of Wishart matrix in the metric of another, 
529 
asymptotic distribution of, 550 
distribution of, 536 
See also Principal components 
Chi-squared distribution, 286 
noncentral, 82 
Cholesky decomposition, 631 
Classification into normal populations 
Bayes 
{nto one of several, 236 
into one of two, 204 
discriminant function, 218 
sample, 220 
example, 240 
invariance of procedures, 226 
likelihood criterion for, 224 
unequal covariance matrices, 242 
linear, for unequal covariance matrices, 243 
admissible, 246 
maximum likelihood rule, 226 
minimax 
one of several, 238 
one of two, 218 
one of several, 237 
one of two, 216 
W-statistic, 219 
asymptotic distribution of, 222 
asymptotic expansion of misclassification 
probabilities, 227 
Z-Statistic, 226 
asymptotic expansion of misclassification 
probabilities, 231 
See also Classification procedures 
Classification procedures 
admissible, 210, 235 
into several popula'ions, 236 
into two populations, 214 
a prior) probabilities, 209 
Bayes, 89, 210 
and admissible, 214, 236 
into several populations, 234 
into two populations, 216 
complete class of, 211, 235 
essentially, 211 
minimal, 211 
costs of misclassification, definition of, 208 
expected loss from misclassification, 210 
minimax, 211 
for two populations, 215 
probability of misclassification, 210, 227 
See also Classification into normal 
populations 


INDEX 


Cochran's theorem, 262 
Coefficient of alienation, 400 
Complete class of procedures, 88 
essentially, 88 
minimal, 88 
Completeness, defliition of, 84 
of sample mean ind covariance matrix, 
84, 85 
Complex normal distribution, 64 
characteristic function of, 65 
linear transformation in, 65 
maxdmum likelihood estimators for, 112 
Complex Wishart distribution, 287 
Components of variance, 429 
Concentration ellipsoid, 58, 85 
Conditional density, 12 
normal, 34 
Conditional probability, 12 
Conjugate family of distributions, 272 
Consistency, definition of, 86 
Contours of constant density, 22 
Correlation coefficient 
canonical, see Canonical correlations and 
variates 
confidence interval for, 128 
by use of Fisher’s z, 135 
distribution of sample, asymptotic, 133 
bootstrap method for, 135 
tabulation of, 126 
when population is not zero, 125 
when population is zero, 121 
distribution of set of sample 272 
Fisher's z, 134 
geometric interpretation of sample, 72 
invariance of population, 21 
invariance of sample, 111 
Iikelihood ratio test, 130 
maximum likelihood estimator of, 71 
as measure of association, 22 
moments of, 166 
monotone likelihood ratio of, 164 
multiple, see Multiple correlation coefficient 
partial, see Partial correlation coefficient 
in the population (simple, Pearson, 
product-moment, total), 20 
sample (Pearson), 71, 116 
test of equality of two, 135 
test of hypothesis about, 126 
by Fisher’s z, 134 
power of, 128 
test that it is zero, 121 
Cosine of angle between two vectors, 72. See 
also Correlation coefficient 
Covariance, 17 


INDEX 


Covariance matrix, 17 
asymptotic distribution of sample, 86 
Bayes estimator of, 273 
characterization of distribution of sample, 
77 
confidence bounds for quadratic form in, 
442 
consistency of sample as estimator of 
population, 86 
distribution of sample, 255 
estimation, see Best estimator of 
geometrical interpretation of sample, 72 
with linear structure, 113 
maximum likelihood estimator of, 70 
computation of, 70 
when the mean vector is known, 112 
of normal distribution, 20 
sample, 77 
singular, 31 
tests of hypotheses, see Testing that a 
covariance matrix is a given matrix; 
Testing that a covariance matrix is 
proportional to a given matrix; Testing 
that a covariance matrix and mean vector 
are equal to a given matrix and vector; 
Testing equality of covariance matrices; 
Testing equality of covariance matrices 
and ‘mean vectors; Testing independence 
of sets of variates 
unbiased estimator of, 77 
Covariance selection models, 614 
decomposition of, 618 
estimation m, 614 
Cramér-Rao lower bound, 85 
Critical function, 192 
Cumulants, 46 
of a multivariate normal distribution, 46 
Cumulative distribution function (cdf), 7 


Decision procedure, 88 
Degenerate rommal distribution, 30 
Density, 7 
conditional, 12 
normal, 34 
marginal, 9 
normal, 27, 29 
multivariate normal, 20 
Determinant, 626 
derivative of, 642 
symmetric matrix, 642 
Dirichlet distribution, 290 
Discriminant function, see Classification into 
normal populations 
Distance, 631 


715 


Distance between two populations, 80 

Distribution, see Canonical correlations; 
Characteristic roots; Correlation 
coefficient; Covariance matrix; 
Cumulative distribution function; Density; 
Generalized variance, Mean vector: 
Multiple correlation coefficient; 
Multivariate normal density; Multivariate 
é-distribution; Noncentral chi-squared 
distribution; Noncentral F-distribution, 
Noncentral T?-distribution; Partial 
correlation coefficient; Principal 
components; Regression coefficients; 
T?-test and statistic; Wishart distribution 

Distribution of matrix of sums of squares and 
cross-products, see Wishart distribution 

Duplication formula for gamma function, 82, 
125, 309 


&,9 
Efficiency of vector estimate, definition of, 85 
Ellipsoid of concentration of vector estimate, 
58, 85 
Ellipsoid of constant density, 32 
Elliptically contoured distribution, 47 
characterjstic function of, 53 
characteristic roots and vectors, asymptotic 
distribution of, 482, 564 
correlation coefficient, asymptotic 
distribution of, 159 
covariance of, 50 
covariance of sample covariance, 101 
covariance of sample mean, 101 
cumulants of, 54 
kurtosis of, 54 
likelihood ratio criterion for equality of 
covariance matrices, asymptotic 
distribution of, 451 
likelihood ratio criterion for irdenendence 
of sets, asymptotic distribution of, 406 
likelihood ratio criterion for linear 
hypotheses, asymptotic distribution of, 371 
maximum likelihood estimator of 
parameters, 104 
multiple correlation coefficient, asymptotic 
distribution of, 159 
rectangular coordinates, asymptotic 
distribution of, 283 
test for regression coefficients, 371 
Elliptically contoured matrix distribution, 104 
characteristic roots and vectors, distribution 
of, 483, 566 
likelihood ratio criterion for equality of 
covariance matrices, distribution of, 454 


716 


Elliptizally contoured matrix (Continued) 
likelihood .atio criterion for independence 
of sets, distribution of, 408 
likelihood ratro criterion for linear 
hypotheses, distribution of, 373 
rectangular coordinates, distribution of, 285 
stochastic representation, 160, 285 
sufficient statistics, 160 
T~-distribution of, 200 
Equiangular linc, 72 
Exp, 15 
Expected value of complex-valued function, 
41 
Exponential family of distributions, 194 
Extended region, 355 


Factor analysis, 569 
centroid method, 586 
communalities, 581 
confirmatory, 574 
EM algorithm, 580, 593 
exploratory, 574 
general factor, 570 
identification of Structure in, 572 
by specified zeros, 571, 593 
loadings, 570 
maximum likelihood estimators, 578 
asymptotic distribution of, 582 
in case of identification by zeros, 590 
nonexistence for fixed factors, 587 
for random factors, 583 
minimum distance methods, 583 
model, 570 
oblique factors, 571 
orthogonal factors, 571 
principal component analysis, relation 
to, 584 
rotation of factors, 571 
scores, estimation of, 591 
simple structure, 573 
space of factors, 572 
transformations, 588 
tests of fit for, 581 
units of measurement, 575 
varimax criterion, modified, 589 
Factorization theorem, 83 
Fisher’s z, 133 
asymptotic distribution of, 134 
moments of, 134 
See also Correlation coefficient; Partial 
correlation coefficient 


TO), 257 
Gamma function, multivariate, 257 


Generalized analysis of variance, see 
Multivariate analysis of variance; 
Regression coefficients and function 

Generalized 77, see T?-test and statistic 

Generalized variance, 264 

asymptotic distribution of sample, 270 
distribution of sample, 268 
geometric interpretation of sample in N 
dimensions, 267 
in p dimensions, 268 
invartance of, 465 
moments of sample, 269 
sample, 265 

Gencral lincar hypothesis, see Linear 
hypothesis, testing of; Regression 
coefficients and function 

Gram-Schmidt orthogonalization, 252, 647 

Graphical moduls, 595 

adjacent, nonadjacent vertices, 596 
AMP (Anderson- Madigan-PerIman) 
Markov chain, 612 
ancestor, 605 
boundary, 598 
chain graph, 630 
chi'd, 605 
clique, 602 
closure, 598 
complete, 602 
decomposition, 603 
descendant, nondescendant, 605 
edges, 595 
directed, undirected, 596 
LWF (L iuritzen-Wermuth-Frydenberg) 
Markov chain, 610 
Markov properties, 597 
globally, 600 
locally, 598 
pairwise, 597 
moral graph, 608 
nodes, 595 
parent, 005 
partial ordering, 605 
path, 600 
recursive factonzation, 60% 
Separate, 600 
vertices, 595 
well-numbered, 607 


Haar invariant distribution of orthogonal 
matrices, 162, 541 
conditional distribution, 542 
Hadamard’s inequality, 61 
Head lengths and breadths of brothers, 109 
Hotelling’s T?, see T?-test and statistic 
Hypergeometric function, 126 


INDEX 


Incomplete beta function, 329 
Independence, 10 
mutual, 11 
of normal variables, 26 
of sample mean vector and sample 
covariance matrix, 77 
tests of, see Correlation coefficient; Multiple 
correlation coefficient, Testing 
independence of sets of variates 
Information matrix, 85 
Integral of a symmetric unimodal function over 
a symmetric convex set, 365 
Intraclass correlation, 484 
Invariance, see Classification into normal 
populations; Correlation coefficient; 
Generalized variance, Linear hypothesis, 
Multiple correlation coefficient, Partial 
correlation coefficient; T?-test;, Testing 
that a covariance matrix is a given matrix; 
Testing that a covariance matrix js 
proportional to a given matrix, Testing 
equality of covariance matrices, Testing 
equality of covariance matrices and means 
vectors; Testing independence of sets of 
variates 
Inverted Wishart distribution 272 
Iris, four meaSurements on, 110, 180 


Jacobian, 13 

James-Stejn estimator, 91 
for arbitrary known covariance matrix, 97 
average mean squared error of, 95 


Kronecker delta, 75 

Kronecker product of matrices, 643 
characteristic roots of, 643 
determinant of, 643 

Kurtosis, 54 
estimation of, 103 


Latin square, 377 

Lawley-Hotelling trace criterion, see Linear 
hypothesis 

Least squares estimator, 295 

Likelihood, induced, 71 

Likelihood function for sample from 
multivariate normal distribution, 67 

Likelihood loss function for covariance matrix, 
276 

Likelihood ratio test, definition of, 129, See 
also Correlation coefficient; Linear 
hypothesis; Mean vector; Multiple 
correlation coefficient, Regression 


717 


coefficients, T*-test; Testing that a 
covariance matrix is given matrix; Testing 
that a covariance matrix is proportional 
to given matrix; Testing that a covariance 
matrix and mean vector are equal to a 
given matrix and vector, Testing equality 
of covariance matrices, Testing equality 
of covariance matrices and mean vectors; 
Testing independence of sets of variates 
Linear combinations of normal variables, 
distribution of, 29 
Linear equations, solution of, 606 
by Gaussian elimination, 607 
Linear functional relationship. 513 
relation to simultaneous equations, 520 
Linear hypothesis, testing of 
admissibility of, 353 
necessary condition for, 363 
Bartlett-Nanda-Pillai trace criterion, 331 
admissibility of, 379 
asymptotic expansion of distribution 
of, 333 
as Bayes procedure, 378 
table of significance points of, 673 
tabulation of power of, 333 
canonical form of, 303 
comparison of powers, 334 
invariance of criteria, 327 
Lawley-Hotelling trace criterion, 328 
admissibility of, 379 
asymptotic expansion of distribution 
of, 330 
monotonicity of power function of, 368 
table of significance points of, 657 
tabulation of, 328 
likelihood ratio criterjon, 300 
admissibility of, 378 
asymptotic expansion of distribution 
of, 321 
as Bayes procedure, 378 
distributions of, 306, 310 
F-approximation to distribution of, 326 
gcometric interpretation of, 302 
moments of, 309 
monotonicity of power function of, 268 
normal approximation to distribution 
of, 323 
table of significance points, 651 
tabulation of distribution of, 314 
Wilks’ A, 300 
monotonicity of power function of an 
invariant test, 363 
Roy’s maximum root criterion, 333 
distribution for p = 2, 334 
monotonicity of power function of, 368 


718 


Linear hypothesis (Continued) 
table of significance points. 677 
tablulation of distribution of, 333 
step-down test, 314 
See alsoRegression coefficients and function 
Linearly independent vectors, 627 
Linear transformation of a normal vector, 23, 
29, al 
Loss. 88 
LR decomposition, 630 


Mahalanobis distance, 80, 217 
sainple, 228 
Majorization, 355 
sweak, 355 
Marginal density, 9 
distribution. 9 
normal, 27 
Mathematical expectation, 9 
Matrix, 624 
bidiagonal, upper, 503 
characleristic roots and vectors of, see 
Characteristic roots and vectors 
cofactor in, 627 
convexity, 358 
definition of. 624 
diagonalization of symmetric. 631 
doubls stochastic, 646 ; 
eigenvalue. see Characteristic roots and 
vectors 
Givens, 471, 649 
Householder. 470, 650 
idempotent, 635 
identity, 626 
inverse, 627 
minor of, 627 
nonsingular, 627 
operations with, 625 
positive definite, 628 
positive semidefinite, 628 
rank of, 628 
symmetric, 626 
trace of, 629 
transpose, 625 
triangular, 629 
tridiagonal, 470 
Matrix of sums of squares and cross-products 
of deviations from the means, 68 
Maximum likelihood estimators, see Canonicul 
correlations and variates; Correlation 
coefficient: Covariance matrix; Mean 
vector, Multiple correlation coefficient, 
Partial corrclation coefficient: Principal 
components; Regression coefficients, 
Variance 


INDEX 


Maximum likelihood estintator of function of 
parameters, 71] 

Maximum of the likelihood function, 70 
Maximum of variance of linear combinations, 
464. See also Principal components 

Mean vector, 17 
asymptotic normality of sample, 86 
completeness of sample as an estimator of" 
population, 84 
confidence region for difference of (wo 
when common covariance matrix is 
known, 80 
when covariitnee matrix is unknown, 
180 
consistency of sample as estimate of 
population, 86 
distribution of sample, 76 
efficiency of sample, 85 
improved cstimalor when covariance matrix 
is unknown, 185 
maximum likelihood estimator of, 70 
sample, 67 
simultaneous confidence regions for linear 
functions of, 178 
testing equality of, in several distributions, 
206 
testing equality of two when common 
covarjance matrix is known, 80 
tests of hypothesis about 
when covariance matrix is known, 80 
when covariance matrix is unkiown, see 
T?-test 
See also James-Stein estimator 
Minimax, 90 
Missing Observations, maximum likelihood 
estimators, 168 
Modulus, 13 
Moments, 9, 41 
factoring of, 1] 
from marginal distributions, 10 
of normal distributions, 46 
Monotone region, 355 
in majorization, 355 
Multiple correlation coefficient 
adjusted, 153 
distribution of sample 
conditional, 154 
when population correlation is not 
zero, 156 
when population correlation is zero, 
150 
geometric interpretation of sample, 148 
invariance of population, 60 
invariance of sample, 166 
likelihood ratio (est that it is zero, 151 


INDEX 


as Maximum correlation between one 
variable and lincar combination of other 
variables, 38 
maximum Iikelihood estimator of, 147 
moments of sample, 156 
optimal properties of, 157 
population, 38 
sample, 145 
tabulation of distribution of, 1/7 
Multivariate analysis of variance (MANOVA), 
346 
Latin square, 377 
one-way, 342 
two-way, 346 
See aiso Linear hypothesis, testing of 
Multivariate beta distribution, 377 
Multivariate of gamma function, 257 
Multivariate normal density, 20 
distribution, 20 
computation of, 23 
Multivariate ¢-distribution, 276, 289 


n(x], Z), 20 
N(w, Z), 20 
Neyman-Pearson fundamental Iemina, 248 
Noncentral chi-squared distribution, 82 
Noncentral F-distribution, 186 

tables of, 186 
Noncentral 7?-distribution, 186 


O(N X p), 161 
Orthonormal vectors, 647 


Parallelotope, 266 
volume of, 266 

Partial correlation coefficient 
computational formulas for, 39, 40, 41 
confidence intervals for, 143 
distribution of sample, 143 
geometric interpretation of sample, 138 
invariance of population, 63 
invariance cf sumple, 166 
maximum likelihood estimator of, 138 
in the population, 35 
recursion formula for, 41 
sample, 138 
tests about, 144 

Partial covariancc, 34 
estimator of, 137 

Partial variance, 34 

Partioning of a matrix, 635 
addition of, 635 
of a covarianee matrix, 25 
deterininant of, 637 


719 


inverse of, 638 
multiplication of, 635 
Partioning of a vector, 635 
of a mean vector, 25 
of a random vector, 24 
Path analysis, 596, See also Graphical models 
Pearson correlation coefficient, see 
Correlation coefficient 
Plane of closest fit, 466 
Polar coordinates, 285 
Positive definite matrix, 628 
Positive part of a function, 96 
of the Jumes-Stein estimator, 97 
Positive semidefinite matrix, 628 
Precision matrix, 272 
unbiased estimator of, 274 
Principal axes of ellipsoids of constant density, 
465. See also Principal components 
Principal componcnts, 459 
asymptotic distribution of sample, 473 
computation of, 469 
confidence region for, 475, 477 
distribution of sample, 540, 542 
maximum likelihood estimator of, 467 
population, 464 
testing hypotheses about, 478, 479, 480 
Probability element, 8 
Product-moment correlation coefficient, see 
Correlation coefficient 


QL algorithm, 471 i 
QR algorithm, 471 
decomposition, 647 
Quadratic form, 628 
Quadratic loss function for covariance matrix, 
376 


r, 71 

8 (real part), 257 

Random matrix, 16 
expected valuc of, 17 

Random vector, 16 

Randomized test, definition of, 192 

Rectangular coordinates, 257 
distribution of, 255, 257 

Reduccd rank regression, 514 
estimator, asymptotic distribution of, 

550 

Regression eoclficicnts and function, 34 
confidence regions for, 339 
distribution of sample, 297 
geomctric intcrpretation of sample, 138 
maximum likelihood estimator of, 294 
partial correlation, connection with, 61 
suniple, 294 


720 


Regression coefficients (Continued) 
simultaneous confidence intervals for, 340, 
341 
testing hypotheses of rank of, 512 
testing they are zero, in case of one 
dependent variable, 152 
Residuals from regression, 37 
Risk function, 88 


Selection of linear combinations, 201 
Simple correlation coefficient, See Correlation 
coefficient 
Simultaneous equations, 513 
estimation of coefficients, 518 
least variance ratio (LVR), 519 
limited information maximum likelihood 
(LIML), 519 
two Stage least squares (TSLS), 522 
identification by zeros, 516 
reduced form, 516 
estimation of, 517 
Singular normal distribution, 30, 31 
Singular value decomposition, 498, 634 
Spherical distribution, 105 
left, 105 
Tight, 105 
vector, 105 
Spherical normal distribution, 23 
Spherically contoured distribution 47 
stochastic representation, 49 
uniform distribution, 48 
Sphericity test, see Testing that a covariance 
matrix is proportional to a given matrix 
Standardized sum statistics, 201 
Standardized variable, 22 
Steifel manifold, 162 
Stochastic convergence, 113 
of a sequence of random matrices, 113 
Sufficiency, definition of, 83 
Sufficiency of sample meun vector and 
covariance matrix, 83 
Surface area of unit sphere, 286 
Surfaces 0° constant density, 22 
Symmetric matrix, 626 


T~-statistic, 176, See also T?-test and statistic 
T?-test and statistic, 173 
admissibility of, 196 
as Bayes procedure, 199 
distribution of statistic, 176 
geonietric interpretation of statistic, 174 
invariance of, 173 
as likelinood ratio test of mean vector, 176 
limiting distribution of, 176 
noncentral distribution of statistic, 186 


INDEX 


optimal properties of, 190 
power of, 186 
tables of, 186 
for testing equality of means when 
covariance matrices are different, 187 
for testing equality of two mean vectors 
when covariance matrix is unknown, 179 
for testing symmetry in mean vector, 182 
as uniformly most powerful invariant test of 
mean vector, 191 
Testing that a covariance matrix is a given 
Matrlx, 438 
invariant tests of, 442 
likelihood ratio criterion for, 438 
modified likelihood ratio criterion for, 438 
asymptotic expansion of distribution of, 
442 
moments of, 440 
table of significance points, 685 
Nagao’s criterion, 442 
Testing that a covariance matrix is 
proportional to a given matrix, 431 
imvarrant tests, 436 
likelihood ratio criterion for, 434 
admissibility, 458 
asymptotic expansion of distribution of, 
435 
moments of, 434 
table of significance points, 682 
Nagao's criterion, 437 
Testing that a covariance matrix and mean 
vector are equal to a given matrix and 
vector, 444 
likelihood ratio criterion for, 444 
asymptotic expansion of distribution of, 
446 
fnoments of, 445 
Testing equality of covariance matrices, 412 
invarianct tests, 428 
likelihaad ratio criterion for, 413 
invariance of, 414 
modified likelihood ratio criterion for, 413 
admissibility of, 449 
asymptotic expansion of distribution of, 
425 
distribution of, 420 
moments of, 422 
cable of significance points, 681 
Nagao’s criterion for, 415 
Testing equality of covariance matrices and 
mean vectors, 415 
likelihood ratio criterion for, 415 
asymptotic expansion of distribution of, 
426 
distribution of, 421 


INDEX 


Moments of, 422 
unbiasedness of, 416 
Testing independence of sets of variates, 
381 
and canonical correlations, 504 
likelihood ratio criterion for, 384 
admissibility of, 401 
asymptotic expansion of distribution of, 
390 
distribution of, 388 
invariance of, 386 
moments of, 388 
monotonicity of power function of, 404 
unbiasedness of, 386 
Nagao’s test, 392 
asymptotic expansion of distribution of, 
392 
stepdown tests, 393 
Testing rank of regression matrices, 512 
Tests of hypotheses, see Correlation 
coefficient; Generalized analysis of 


variance; Linear hypothesis; Mean vector, 


Multiple correlation coefficient; Partial 

correlation coefficient; Regression 

coefficients; 7*-test and statistic 
Tetrachoric functions, 23 


721 


Total correlation coefficient, see Correlation 
coefficient 

T.ace of a matrix. 629 

Transformation of variables, 12 


Unbiased estimator, definition af, 77 

Unibased test, definition of, 364 

Uniform distribution on unit sphere, 48 
on O(N Xp), 162 


Variance, 17 
generalized, see Generalized variance 
maximum likelihood estimator of, 71 


w( Al, 7), 255 

W(E, 2), 255 

wo '( BIW, m), 272 

w~'(W mm), 272 

Wishart distribution, 256 
characteristic function of, 258 
geometric interpretation af, 256 
marginal distributions of, 260 
noncentral, 587 
for p = 2, 124 


z, see Fisher's z 
Zonal polynomials, 473 


